Hadoop Internals

Have Queries? Ask us +91 72592 22234

Course Overview


Hadoop Internals training course, you will gain a comprehensive understanding of all the steps necessary to operate and maintain a Hadoop cluster. Covering topics from installation and configuration through load balancing and tuning, this course is the best preparation for the real-world challenges faced by Hadoop administrators. Hadoop Internals course covers concepts addressed on the Cloudera Certified Administrator for Apache Hadoop (CCAH) exam.

Course Content


Hadoop Introduction

  • Move computation not data
  • Hadoop performance and data scale facts
  • Hadoop in the context of other data stores
  • The Apache Hadoop Project
  • Hadoop - an inside view: MapReduce and HDFS
  • The Hadoop Ecosystem
  • What about NoSQL?
  • Comparison with Other Systems
  • RDBMS
  • Grid Computing
  • Volunteer Computing
  • A Brief History of Hadoop
  • Apache Hadoop and the Hadoop Ecosystem
  • Hadoop Releases

MapReduce

  • Analyzing the Data with Hadoop
  • Map and Reduce
  • Java MapReduceScaling Out
  • Data FlowCombiner Functions
  • Running a Distributed MapReduce Job
  • Hadoop Streaming
    • Ruby
    • Python
  • Hadoop Pipes
  • Constructing the basic template of a MapReduce program
  • Counting things
  • Adapting for Hadoop's API changes
  • Streaming in Hadoop
    • Streaming with Unix commands
    • Streaming with scripts
    • Streaming with key/value pairs
    • Streaming with the Aggregate package
  • Improving performance with combiners

Distributing Data with HDFS

  • The Design of HDFS
  • HDFS Concepts
    • Blocks
    • Namenodes and Datanodes
    • HDFS Federation
    • HDFS High-Availability
  • The Command-Line Interface
    • Basic Filesystem Operations
  • Hadoop Filesystems
  • Interfaces
  • The Java Interface
    • Reading Data from a Hadoop URL
    • Reading Data Using the FileSystem API
    • Writing Data
    • Directories
    • Querying the Filesystem
    • Deleting Data
  • Data Flow
    • Anatomy of a File Read
    • Anatomy of a File Write
    • Coherency Model
  • Parallel Copying with distcp
    • Keeping an HDFS Cluster Balanced
    • Hadoop Archives
  • Using Hadoop Archives
    • Limitations

Understanding Hadoop I/O

  • Data Integrity
    • Data Integrity in HDFS
    • LocalFileSystem
    • ChecksumFileSystem
  • Compression
    • Codecs
    • Compression and Input Splits
    • Using Compression in MapReduce
  • Serialization
    • The Writable Interface
    • Writable Classes
    • Implementing a Custom Writable
    • Serialization Frameworks
    • Avro
  • File-Based Data Structures
    • SequenceFile
    • MapFile

Advanced MapReduce

  • Chaining MapReduce jobs
    • Chaining MapReduce jobs in a sequence
    • Chaining MapReduce jobs with complex dependency
    • Chaining preprocessing and postprocessing steps
  • Joining data from different sources
    • Reduce-side joining
    • Replicated joins using DistributedCache
    • Semijoin: reduce-side join with map-side filtering
  • Creating a Bloom filter
    • What does a Bloom filter do?
    • Implementing a Bloom filter
    • Bloom filter in Hadoop version 0.20+

Writing Map-Reduce Applications

  • The Configuration API
  • Configuring the Development Environment
  • Running Locally on Test Data
  • Cluster Specs
  • Cluster Setup and Installation
  • Hadoop Configuration
  • YARN Configuration
  • Benchmarking a Hadoop Cluster
  • Hadoop in the Cloud
  • Tuning
  • MapReduce Workflows
  • Monitoring and debugging on a production cluster
  • Tuning for performance

Map-Reduce Internals

  • Anatomy of a MapReduce Job Run
    • Classic MapReduce (MapReduce 1)
    • YARN (MapReduce 2)
  • Failures
    • Failures in Classic MapReduce
    • Failures in YARN
  • Job Scheduling
    • The Fair Scheduler
    • The Capacity Scheduler
  • Shuffle and Sort
    • The Map Side
    • The Reduce Side
    • Configuration Tuning
  • Task Execution
    • The Task Execution Environment
    • Speculative Execution
    • Output Committers
    • Task JVM Reuse
    • Skipping Bad Records

Managing Hadoop

  • Setting up parameter values for practical use
  • Checking system's health
  • Setting permissions
  • Managing quotas
  • Enabling trash
  • Removing DataNodes
  • Adding DataNodes
  • Managing NameNode and Secondary NameNode
  • Recovering from a failed NameNode
  • Designing network layout and rack awareness
  • Map-Reduce Features
    • Counters
    • Sorting
    • Joins
    • Side Data Distribution
    • Map-Reduce Library

Map-Reduce Ecosystem

  • Pig
    • Thinking like a Pig
      • Data flow language
      • Data types
      • User-defined functions
    • Installing Pig
    • Running Pig
      • Managing the Grunt shell
      • Learning Pig Latin through Grunt
    • Speaking Pig Latin
      • Data types and schemas
      • Expressions and functions
      • Relational operators
      • Execution optimization
    • Hive
      • Installing and configuring Hive
      • Example queries
      • HiveQL in details
      • Hive Sum-up
    • Hbase
      • Intoduction
      • Concepts
      • Clients
      • Hbase vs RDBMS

Customer Reviews


Thanks to Xpertised and the tutor who walked me through all the topics with Practical exposure which is helping me in my current project.
-Waseem

Course was quite helpful in terms of understanding of concepts and practicality. Its really a very friendly environment to learn. The timing were mutually chosen, as we both are working professional. I am quite satisfied with the course.
-Tanmoy

...more
Share:

For Batch Details
Call us at: +91 7259222234

Not sure? Consult Our Experts

Looking for a Training for

Myself

My Team/Organization

I agree to be contacted over mail or phone

or
Call us at: +91 7259222234