Get in Touch

Course Outline

Introduction:

  • Apache Spark within the Hadoop Ecosystem
  • Brief introduction to Python and Scala

Fundamentals (Theory):

  • Architecture Overview
  • Resilient Distributed Datasets (RDDs)
  • Transformations and Actions
  • Stages, Tasks, and Dependencies

Exploring the Basics via the Databricks Environment (Hands-on Workshop):

  • Exercises utilizing the RDD API
  • Fundamental action and transformation functions
  • PairRDDs
  • Join Operations
  • Caching Strategies
  • Exercises utilizing the DataFrame API
  • SparkSQL
  • DataFrame Operations: select, filter, group, sort
  • User-Defined Functions (UDFs)
  • Exploring the DataSet API
  • Streaming Data

Understanding Cloud Deployment via the AWS Environment (Hands-on Workshop):

  • Introduction to AWS Glue
  • Distinguishing between AWS EMR and AWS Glue
  • Example jobs performed in both environments
  • Analysis of advantages and disadvantages

Additional Topics:

  • Introduction to Apache Airflow for orchestration

Requirements

Programming skills (preferably in Python or Scala)

Basic knowledge of SQL

 21 Hours

Number of participants


Price per participant

Testimonials (3)

Upcoming Courses

Related Categories