Hadoop Internals

Have Queries? Ask us +91 72592 22234

Course Overview

Hadoop Internals training course, you will gain a comprehensive understanding of all the steps necessary to operate and maintain a Hadoop cluster. Covering topics from installation and configuration through load balancing and tuning, this course is the best preparation for the real-world challenges faced by Hadoop administrators. Hadoop Internals course covers concepts addressed on the Cloudera Certified Administrator for Apache Hadoop (CCAH) exam.

Course Content

Hadoop Introduction

  • Move computation not data
  • Hadoop performance and data scale facts
  • Hadoop in the context of other data stores
  • The Apache Hadoop Project
  • Hadoop - an inside view: MapReduce and HDFS
  • The Hadoop Ecosystem
  • What about NoSQL?
  • Comparison with Other Systems
  • Grid Computing
  • Volunteer Computing
  • A Brief History of Hadoop
  • Apache Hadoop and the Hadoop Ecosystem
  • Hadoop Releases


  • Analyzing the Data with Hadoop
  • Map and Reduce
  • Java MapReduceScaling Out
  • Data FlowCombiner Functions
  • Running a Distributed MapReduce Job
  • Hadoop Streaming
    • Ruby
    • Python
  • Hadoop Pipes
  • Constructing the basic template of a MapReduce program
  • Counting things
  • Adapting for Hadoop's API changes
  • Streaming in Hadoop
    • Streaming with Unix commands
    • Streaming with scripts
    • Streaming with key/value pairs
    • Streaming with the Aggregate package
  • Improving performance with combiners

Distributing Data with HDFS

  • The Design of HDFS
  • HDFS Concepts
    • Blocks
    • Namenodes and Datanodes
    • HDFS Federation
    • HDFS High-Availability
  • The Command-Line Interface
    • Basic Filesystem Operations
  • Hadoop Filesystems
  • Interfaces
  • The Java Interface
    • Reading Data from a Hadoop URL
    • Reading Data Using the FileSystem API
    • Writing Data
    • Directories
    • Querying the Filesystem
    • Deleting Data
  • Data Flow
    • Anatomy of a File Read
    • Anatomy of a File Write
    • Coherency Model
  • Parallel Copying with distcp
    • Keeping an HDFS Cluster Balanced
    • Hadoop Archives
  • Using Hadoop Archives
    • Limitations

Understanding Hadoop I/O

  • Data Integrity
    • Data Integrity in HDFS
    • LocalFileSystem
    • ChecksumFileSystem
  • Compression
    • Codecs
    • Compression and Input Splits
    • Using Compression in MapReduce
  • Serialization
    • The Writable Interface
    • Writable Classes
    • Implementing a Custom Writable
    • Serialization Frameworks
    • Avro
  • File-Based Data Structures
    • SequenceFile
    • MapFile

Advanced MapReduce

  • Chaining MapReduce jobs
    • Chaining MapReduce jobs in a sequence
    • Chaining MapReduce jobs with complex dependency
    • Chaining preprocessing and postprocessing steps
  • Joining data from different sources
    • Reduce-side joining
    • Replicated joins using DistributedCache
    • Semijoin: reduce-side join with map-side filtering
  • Creating a Bloom filter
    • What does a Bloom filter do?
    • Implementing a Bloom filter
    • Bloom filter in Hadoop version 0.20+

Writing Map-Reduce Applications

  • The Configuration API
  • Configuring the Development Environment
  • Running Locally on Test Data
  • Cluster Specs
  • Cluster Setup and Installation
  • Hadoop Configuration
  • YARN Configuration
  • Benchmarking a Hadoop Cluster
  • Hadoop in the Cloud
  • Tuning
  • MapReduce Workflows
  • Monitoring and debugging on a production cluster
  • Tuning for performance

Map-Reduce Internals

  • Anatomy of a MapReduce Job Run
    • Classic MapReduce (MapReduce 1)
    • YARN (MapReduce 2)
  • Failures
    • Failures in Classic MapReduce
    • Failures in YARN
  • Job Scheduling
    • The Fair Scheduler
    • The Capacity Scheduler
  • Shuffle and Sort
    • The Map Side
    • The Reduce Side
    • Configuration Tuning
  • Task Execution
    • The Task Execution Environment
    • Speculative Execution
    • Output Committers
    • Task JVM Reuse
    • Skipping Bad Records

Managing Hadoop

  • Setting up parameter values for practical use
  • Checking system's health
  • Setting permissions
  • Managing quotas
  • Enabling trash
  • Removing DataNodes
  • Adding DataNodes
  • Managing NameNode and Secondary NameNode
  • Recovering from a failed NameNode
  • Designing network layout and rack awareness
  • Map-Reduce Features
    • Counters
    • Sorting
    • Joins
    • Side Data Distribution
    • Map-Reduce Library

Map-Reduce Ecosystem

  • Pig
    • Thinking like a Pig
      • Data flow language
      • Data types
      • User-defined functions
    • Installing Pig
    • Running Pig
      • Managing the Grunt shell
      • Learning Pig Latin through Grunt
    • Speaking Pig Latin
      • Data types and schemas
      • Expressions and functions
      • Relational operators
      • Execution optimization
    • Hive
      • Installing and configuring Hive
      • Example queries
      • HiveQL in details
      • Hive Sum-up
    • Hbase
      • Intoduction
      • Concepts
      • Clients
      • Hbase vs RDBMS

Customer Reviews

Thanks to Xpertised and the tutor who walked me through all the topics with Practical exposure which is helping me in my current project.

Course was quite helpful in terms of understanding of concepts and practicality. Its really a very friendly environment to learn. The timing were mutually chosen, as we both are working professional. I am quite satisfied with the course.


For Batch Details
Call us at: +91 7259222234

Not sure? Consult Our Experts

Looking for a Training for


My Team/Organization

I agree to be contacted over mail or phone

Call us at: +91 7259222234