Course Outline
Introduction, Objectives, and Migration Strategy
- Course goals, alignment with participant profiles, and success criteria.
- High-level migration approaches and consideration of associated risks.
- Setting up workspaces, repositories, and lab datasets.
Day 1 — Migration Fundamentals and Architecture
- Core concepts of the Lakehouse, an overview of Delta Lake, and Databricks architecture.
- Differences between SMP and MPP architectures and their implications for migration.
- Designing the Medallion (Bronze→Silver→Gold) pattern and an overview of Unity Catalog.
Day 1 Lab — Translating a Stored Procedure
- Hands-on migration of a sample stored procedure into a notebook.
- Mapping temporary tables and cursors to DataFrame transformations.
- Validation and comparison with the original output.
Day 2 — Advanced Delta Lake & Incremental Loading
- ACID transactions, commit logs, versioning, and time travel capabilities.
- Auto Loader, MERGE INTO patterns, upserts, and schema evolution.
- OPTIMIZE, VACUUM, Z-ORDER, partitioning, and storage tuning.
Day 2 Lab — Incremental Ingestion & Optimization
- Implementing Auto Loader ingestion and MERGE workflows.
- Applying OPTIMIZE, Z-ORDER, and VACUUM; validating results.
- Measuring improvements in read and write performance.
Day 3 — SQL in Databricks, Performance & Debugging
- Analytical SQL features: window functions, higher-order functions, and JSON/array handling.
- Reading the Spark UI, DAGs, shuffles, stages, tasks, and diagnosing bottlenecks.
- Query tuning patterns: broadcast joins, hints, caching, and spill reduction.
Day 3 Lab — SQL Refactoring & Performance Tuning
- Refactoring a complex SQL process into optimized Spark SQL.
- Using Spark UI traces to identify and fix skew and shuffle issues.
- Benchmarking before and after results and documenting tuning steps.
Day 4 — Tactical PySpark: Replacing Procedural Logic
- Spark execution model: driver, executors, lazy evaluation, and partitioning strategies.
- Transforming loops and cursors into vectorized DataFrame operations.
- Modularization, UDFs/pandas UDFs, widgets, and reusable libraries.
Day 4 Lab — Refactoring Procedural Scripts
- Refactoring a procedural ETL script into modular PySpark notebooks.
- Introducing parametrization, unit-style tests, and reusable functions.
- Conducting code reviews and applying best-practice checklists.
Day 5 — Orchestration, End-to-End Pipeline & Best Practices
- Databricks Workflows: job design, task dependencies, triggers, and error handling.
- Designing incremental Medallion pipelines with quality rules and schema validation.
- Integration with Git (GitHub/Azure DevOps), CI, and testing strategies for PySpark logic.
Day 5 Lab — Build a Complete End-to-End Pipeline
- Assembling a Bronze→Silver→Gold pipeline orchestrated with Workflows.
- Implementing logging, auditing, retries, and automated validations.
- Running the full pipeline, validating outputs, and preparing deployment notes.
Operationalization, Governance, and Production Readiness
- Unity Catalog governance, lineage, and access controls best practices.
- Cost management, cluster sizing, autoscaling, and job concurrency patterns.
- Deployment checklists, rollback strategies, and runbook creation.
Final Review, Knowledge Transfer, and Next Steps
- Participant presentations of migration work and lessons learned.
- Gap analysis, recommended follow-up activities, and training materials handoff.
- References, further learning paths, and support options.
Requirements
- A solid understanding of data engineering concepts.
- Experience with SQL and stored procedures (specifically Synapse or SQL Server).
- Familiarity with ETL orchestration concepts (such as Azure Data Factory or similar tools).
Audience
- Technology managers possessing a data engineering background.
- Data engineers transitioning procedural OLAP logic to Lakehouse patterns.
- Platform engineers tasked with Databricks adoption.