Get in Touch

Course Outline

PySpark & Machine Learning 

Module 1: Big Data & Spark Foundations

  • An overview of the Big Data ecosystem and Spark's role within modern data platforms
  • Exploring Spark architecture: drivers, executors, cluster managers, lazy evaluation, DAGs, and execution planning
  • Distinguishing between RDD and DataFrame APIs and identifying when to utilise each approach
  • Establishing and configuring SparkSession, along with understanding the fundamentals of application configuration

Module 2: PySpark DataFrames

  • Reading and writing data to and from enterprise sources and formats (CSV, JSON, Parquet, Delta)
  • Working with PySpark DataFrames: transformations, actions, column expressions, filtering, joins, and aggregations
  • Executing advanced operations such as window functions, managing timestamps, and handling nested data
  • Implementing data quality checks and writing reusable, maintainable PySpark code

Module 3: Processing Large Datasets Efficiently

  • Grasping performance fundamentals: partitioning strategies, shuffle behaviour, caching, and persistence
  • Utilising optimisation techniques such as broadcast joins and execution plan analysis
  • Efficiently processing large datasets and adhering to best practices for scalable data workflows
  • Comprehending schema evolution and modern storage formats employed in enterprise environments

Module 4: Feature Engineering at Scale

  • Conducting feature engineering with Spark MLlib: managing missing values, encoding categorical variables, and scaling features
  • Designing reusable preprocessing steps and preparing datasets for Machine Learning pipelines
  • Introduction to feature selection and addressing imbalanced datasets

Module 5: Machine Learning with Spark MLlib

  • Understanding MLlib architecture and the Estimator/Transformer pattern
  • Training regression and classification models at scale (Linear Regression, Logistic Regression, Decision Trees, Random Forest)
  • Comparing models and interpreting results within distributed Machine Learning workflows

Module 6: End-to-End ML Pipelines

  • Constructing end-to-end Machine Learning pipelines that integrate preprocessing, feature engineering, and modelling
  • Applying train/validation/test split strategies
  • Conducting cross-validation and hyperparameter tuning using grid search and random search
  • Structuring reproducible Machine Learning experiments

Module 7: Model Evaluation & Practical ML Decision Making

  • Applying suitable evaluation metrics for regression and classification problems
  • Identifying overfitting and underfitting, and making informed decisions regarding model selection
  • Interpreting feature importance and gaining insight into model behaviour

Module 8: Production & Enterprise Practices

  • Persisting and loading models in Spark
  • Implementing batch inference workflows on large datasets
  • Understanding the Machine Learning lifecycle within enterprise environments
  • Introduction to versioning, experiment tracking concepts, and basic testing strategies

 

Practical Outcome

  • Competence in working autonomously with PySpark
  • Capability to process large datasets efficiently
  • Skill in performing feature engineering at scale
  • Ability to build scalable Machine Learning pipelines

Requirements

Participants should possess the following background:

Foundational knowledge of Python programming, including experience with functions, data structures, and libraries
A basic grasp of data analysis concepts such as datasets, transformations, and aggregations
Elementary understanding of SQL and relational data principles
Introductory familiarity with Machine Learning concepts, including training datasets, features, and evaluation metrics
While not mandatory, familiarity with command line environments and fundamental software development practices is recommended

Prior experience with Pandas, NumPy, or comparable data processing libraries is advantageous but not required.

 21 Hours

Number of participants


Price per participant

Testimonials (1)

Upcoming Courses

Related Categories