Hadoop Development Training Syllabus

Curriculum Designed by Experts

Hadoop Introduction

  • Move computation not data
  • Hadoop performance and data scale facts
  • Hadoop in the context of other data stores
  • The Apache Hadoop Project
  • Hadoop - an inside view: MapReduce and HDFS
  • The Hadoop Ecosystem
  • What about NoSQL?
  • Comparison with Other Systems
  • RDBMS
  • Grid Computing
  • Volunteer Computing
  • A Brief History of Hadoop
  • Apache Hadoop and the Hadoop Ecosystem
  • Hadoop Releases

MapReduce

  • Analyzing the Data with Hadoop
  • Map and Reduce
  • Java MapReduceScaling Out
  • Data FlowCombiner Functions
  • Running a Distributed MapReduce Job
  • Hadoop Streaming
    • Ruby
    • Python
  • Hadoop Pipes
  • Constructing the basic template of a MapReduce program
  • Counting things
  • Adapting for Hadoop's API changes
  • Streaming in Hadoop
    • Streaming with Unix commands
    • Streaming with scripts
    • Streaming with key/value pairs
    • Streaming with the Aggregate package
  • Improving performance with combiners

Distributing Data with HDFS

  • The Design of HDFS
  • HDFS Concepts
    • Blocks
    • Namenodes and Datanodes
    • HDFS Federation
    • HDFS High-Availability
  • The Command-Line Interface
    • Basic Filesystem Operations
  • Hadoop Filesystems
  • Interfaces
  • The Java Interface
    • Reading Data from a Hadoop URL
    • Reading Data Using the FileSystem API
    • Writing Data
    • Directories
    • Querying the Filesystem
    • Deleting Data
  • Data Flow
    • Anatomy of a File Read
    • Anatomy of a File Write
    • Coherency Model
  • Parallel Copying with distcp
    • Keeping an HDFS Cluster Balanced
    • Hadoop Archives
  • Using Hadoop Archives
    • Limitations

Understanding Hadoop I/O

  • Data Integrity
    • Data Integrity in HDFS
    • LocalFileSystem
    • ChecksumFileSystem
  • Compression
    • Codecs
    • Compression and Input Splits
    • Using Compression in MapReduce
  • Serialization
    • The Writable Interface
    • Writable Classes
    • Implementing a Custom Writable
    • Serialization Frameworks
    • Avro
  • File-Based Data Structures
    • SequenceFile
    • MapFile

Advanced MapReduce

  • Chaining MapReduce jobs
    • Chaining MapReduce jobs in a sequence
    • Chaining MapReduce jobs with complex dependency
    • Chaining preprocessing and postprocessing steps
  • Joining data from different sources
    • Reduce-side joining
    • Replicated joins using DistributedCache
    • Semijoin: reduce-side join with map-side filtering
  • Creating a Bloom filter
    • What does a Bloom filter do?
    • Implementing a Bloom filter
    • Bloom filter in Hadoop version 0.20+
  • Writing Map-Reduce Applications
    • The Configuration API
    • Configuring the Development Environment
    • Running Locally on Test Data
    • Cluster Specs
    • Cluster Setup and Installation
    • Hadoop Configuration
    • YARN Configuration
    • Benchmarking a Hadoop Cluster
    • Hadoop in the Cloud
    • Tuning
    • MapReduce Workflows
    • Monitoring and debugging on a production cluster
    • Tuning for performance

Map-Reduce Internals

  • Anatomy of a MapReduce Job Run
    • Classic MapReduce (MapReduce 1)
    • YARN (MapReduce 2)
  • Failures
    • Failures in Classic MapReduce
    • Failures in YARN
  • Job Scheduling
    • The Fair Scheduler
    • The Capacity Scheduler
  • Shuffle and Sort
    • The Map Side
    • The Reduce Side
    • Configuration Tuning
  • Task Execution
    • The Task Execution Environment
    • Speculative Execution
    • Output Committers
    • Task JVM Reuse
    • Skipping Bad Records

Managing Hadoop

  • Setting up parameter values for practical use
  • Checking system's health
  • Setting permissions
  • Managing quotas
  • Enabling trash
  • Removing DataNodes
  • Adding DataNodes
  • Managing NameNode and Secondary NameNode
  • Recovering from a failed NameNode
  • Designing network layout and rack awareness
  • Map-Reduce Features
    • Counters
    • Sorting
    • Joins
    • Side Data Distribution
    • Map-Reduce Library

Map-Reduce Ecosystem

  • Pig
  • Thinking like a Pig
    • Data flow language
    • Data types
    • User-defined functions
  • Installing Pig
  • Running Pig
    • Managing the Grunt shell
    • Learning Pig Latin through Grunt
  • Speaking Pig Latin
    • Data types and schemas
    • Expressions and functions
    • Relational operators
    • Execution optimization
  • Hive
    • Installing and configuring Hive
    • Example queries
    • HiveQL in details
    • Hive Sum-up
  • Hbase
    • Intoduction
    • Concepts
    • Clients
    • Hbase vs RDBMS

Talk to our Advisor.

+91