Get in Touch

Course Outline

Section 1: Introduction to Hadoop

  • Hadoop history and core concepts
  • Ecosystem overview
  • Distributions
  • High-level architecture
  • Common myths about Hadoop
  • Challenges in Hadoop
  • Hardware and software requirements
  • Lab: First look at Hadoop

Section 2: HDFS

  • Design and architecture
  • Core concepts (horizontal scaling, replication, data locality, rack awareness)
  • Daemons: Namenode, Secondary namenode, Data node
  • Communications and heartbeats
  • Data integrity
  • Read and write paths
  • Namenode High Availability (HA) and Federation
  • Labs: Interacting with HDFS

Section 3 : Map Reduce

  • Concepts and architecture
  • Daemons (MRV1): jobtracker and tasktracker
  • Phases: driver, mapper, shuffle/sort, reducer
  • MapReduce Version 1 and Version 2 (YARN)
  • Internals of MapReduce
  • Introduction to Java MapReduce programs
  • Labs: Running a sample MapReduce program

Section 4 : Pig

  • Pig versus Java MapReduce
  • Pig job flow
  • Pig Latin language
  • ETL with Pig
  • Transformations and joins
  • User-defined functions (UDF)
  • Labs: Writing Pig scripts to analyze data

Section 5: Hive

  • Architecture and design
  • Data types
  • SQL support in Hive
  • Creating Hive tables and querying
  • Partitions
  • Joins
  • Text processing
  • Labs: Various labs on processing data with Hive

Section 6: HBase

  • Concepts and architecture
  • HBase versus RDBMS versus Cassandra
  • HBase Java API
  • Time series data on HBase
  • Schema design
  • Labs: Interacting with HBase using the shell; programming in the HBase Java API; Schema design exercise

Requirements

  • Proficiency in the Java programming language (as most programming exercises are conducted in Java)
  • Familiarity with the Linux environment (ability to navigate the Linux command line and edit files using vi or nano)

Lab environment

Zero Install : No need to install Hadoop software on your personal machine! A fully functional Hadoop cluster will be provided for you.

Students will need the following

  • an SSH client (Linux and Mac come with ssh clients pre-installed; Putty is recommended for Windows)
  • a browser to access the cluster, with Firefox being the recommended choice
 28 Hours

Number of participants


Price per participant

Testimonials (1)

Upcoming Courses

Related Categories