Big Data Training Outline

This training course is for those who want to move to Big Data (Hadoop) as a career. Consisting of in-class training and real-time project. This course consists of two separate modules.

 

The module provides Hadoop Overview and it will give you an overview of big data strategy as well as a why it is important to understand and use big data. It will cover Big Data as a platform for managing and gaining insights from your big data. As such, you will see how the Companies have aligned their offerings to better suit your needs with the Open Data Platform along with the three specialized modules with value-add that sits on top of the ODP.

 

 

This module provides an in-depth introduction to the main components of the ODP core –namely Apache Hadoop (inclusive of HDFS, YARN, and MapReduce), Apache Ambari, Apache Hive and Hbase. Students will have the opportunity to experience the programming languages used to load and query and analyze the data. This course will also guide the students on major Hadoop vendors (IBM, Hortonworks and Cloudera) packaging structures and how they differ.

 

Key Topics

  • Understand the purpose of big data and know why it is important
  • List the sources of data (data-at-rest vs data-in-motion)
  • Describe the major components of the open-source Apache Hadoop stack.
  • Manage and monitor Hadoop clusters with Apache Ambari and related components
  • Explore the Hadoop Distributed File System (HDFS) by running Hadoop commands.
  • Understand the differences between Hadoop 1 (with MapReduce 1) and Hadoop 2 (with YARN and MapReduce 2).
  • Create and run basic MapReduce jobs.
  • Explain the role of coordination, management, and governance in the Hadoop ecosystem using Apache Zookeeper
  • Explore common methods for performing data movement
  • Configure Flume for data loading
  • Move data into the Hadoop from relational databases using Sqoop
  • Understand when to use various data storage formats (flat files, CSV/delimited, Sequence files etc..
  • Review the differences between the available open-source programming languages typically used with Hadoop (Pig, Hive) and for Data Science (Python, R) Query data from Hive.
  • Perform random access on data stored in HBase.
  • Explore advanced concepts, including Oozie

 

Additional:

  • Describe the BigData Offerings from IBM ( BigInsights, Streams and SPSS)
  • Utilize the various IBM BigInsights tools including Big SQL, BigSheets, for your big data needs.

 

Course Outline

  • Unit 1: Introduction to Big Data
    • Exercise 1: Setting up the lab environment
    • Exercise 2: Getting started with Hadoop
  • Unit 1: Open Data Platform with Apache Hadoop
    • Exercise 1: Exploring the HDFS
  • Unit 2: Apache Ambari
    • Exercise 2: Managing Hadoop clusters with Apache Ambari
  • Unit 3: Hadoop Distributed File System
    • Exercise 3: File access & basic commands with HDFS
  • Unit 4: MapReduce and Yarn
    • Topic 1: Introduction to MapReduce based on MR1
    • Topic 2: Limitations of MR1
    • Topic 3: YARN and MR2
    • Exercise 4: Creating and coding a simple MapReduce job (Possibly a more complex second Exercise)
  • Unit 5: Coordination, management, and governance
    • Exercise 6: Apache ZooKeeper
  • Unit 6: Data Movement
    • Exercise 7.1: Moving unstructured data into Hadoop with Flume
    • Exercise 7.2: Moving structured data (from Database) into Hadoop with Sqoop
  • Unit 7: Storing and Accessing Data
    • Topic 1: Representing Data: CSV, XML and JSON
    • Topic 2: Programming Languages: Pig, Hive
    • Topic 3: NoSQL Concepts
    • Topic 4: Accessing Hadoop data using Hive
      • Exercise 8: Performing CRUD operations using the HBase shell
    • Topic 5: Querying Hadoop data using Hive
      • Exercise 9: Using Hive to Access Hadoop / HBase Data
    • Unit 8: Advanced Topics
      • Topic 1: Controlling job workflows with Oozie
      • Topic 2: Search using elastic search
      • Topic 3: Apache Spark
      • Exercise 5: Working with Spark’s RDD to a Spark job

Project:

This is a two-week project, where students will work on real project and each student will contribute their role to a production ready big data client use case. This will introduce the students from the beginning of the use case, planning of the infrastructure and environment, deployment of environment.

 

Key concepts

  • Interaction with shared development code repository (Git)
  • Big Data development environment (Eclipse based Tooling)
  • Provisioning the servers
  • Installing and configuring the Hadoop Servers
  • Configuring the Clusters
  • Developing the map reduce jobs to the analyze the data
  • Manual testing of the deployed code
  • Writing the test scripts to automated the testing
  • Developing a dashboard to visualize the data generated by the Map reduce jobs
  • Deploying the Code on the Big Data clusters
Register For the Course