Hadoop Development

Have Queries? Ask us +91 72592 22234

Course Overview


Hadoop Internals course, you will gain a comprehensive understanding of all the steps necessary to operate and maintain a Hadoop cluster. Covering topics from installation and configuration through load balancing and tuning, Hadoop Internals course is the best preparation for the real-world challenges faced by Hadoop administrators. Hadoop Internals course covers concepts addressed on the Cloudera Certified Administrator for Apache Hadoop (CCAH) exam.

Course Content


Hadoop Introduction

  • Move computation not data
  • Hadoop performance and data scale facts
  • Hadoop in the context of other data stores
  • The Apache Hadoop Project
  • Hadoop - an inside view: MapReduce and HDFS
  • The Hadoop Ecosystem
  • What about NoSQL?
  • Comparison with Other Systems
  • RDBMS
  • Grid Computing
  • Volunteer Computing
  • A Brief History of Hadoop
  • Apache Hadoop and the Hadoop Ecosystem
  • Hadoop Releases

MapReduce

  • Analyzing the Data with Hadoop
  • Map and Reduce
  • Java MapReduceScaling Out
  • Data FlowCombiner Functions
  • Running a Distributed MapReduce Job
  • Hadoop Streaming
    • Ruby
    • Python
  • Hadoop Pipes
  • Constructing the basic template of a MapReduce program
  • Counting things
  • Adapting for Hadoop's API changes
  • Streaming in Hadoop
    • Streaming with Unix commands
    • Streaming with scripts
    • Streaming with key/value pairs
    • Streaming with the Aggregate package
  • Improving performance with combiners

Distributing Data with HDFS

  • The Design of HDFS
  • HDFS Concepts
    • Blocks
    • Namenodes and Datanodes
    • HDFS Federation
    • HDFS High-Availability
  • The Command-Line Interface
    • Basic Filesystem Operations
  • Hadoop Filesystems
  • Interfaces
  • The Java Interface
    • Reading Data from a Hadoop URL
    • Reading Data Using the FileSystem API
    • Writing Data
    • Directories
    • Querying the Filesystem
    • Deleting Data
  • Data Flow
    • Anatomy of a File Read
    • Anatomy of a File Write
    • Coherency Model
  • Parallel Copying with distcp
    • Keeping an HDFS Cluster Balanced
    • Hadoop Archives
  • Using Hadoop Archives
    • Limitations

Understanding Hadoop I/O

  • Data Integrity
    • Data Integrity in HDFS
    • LocalFileSystem
    • ChecksumFileSystem
  • Compression
    • Codecs
    • Compression and Input Splits
    • Using Compression in MapReduce
  • Serialization
    • The Writable Interface
    • Writable Classes
    • Implementing a Custom Writable
    • Serialization Frameworks
    • Avro
  • File-Based Data Structures
    • SequenceFile
    • MapFile

Advanced MapReduce

  • Chaining MapReduce jobs
    • Chaining MapReduce jobs in a sequence
    • Chaining MapReduce jobs with complex dependency
    • Chaining preprocessing and postprocessing steps
  • Joining data from different sources
    • Reduce-side joining
    • Replicated joins using DistributedCache
    • Semijoin: reduce-side join with map-side filtering
  • Creating a Bloom filter
    • What does a Bloom filter do?
    • Implementing a Bloom filter
    • Bloom filter in Hadoop version 0.20+
  • Writing Map-Reduce Applications
    • The Configuration API
    • Configuring the Development Environment
    • Running Locally on Test Data
    • Cluster Specs
    • Cluster Setup and Installation
    • Hadoop Configuration
    • YARN Configuration
    • Benchmarking a Hadoop Cluster
    • Hadoop in the Cloud
    • Tuning
    • MapReduce Workflows
    • Monitoring and debugging on a production cluster
    • Tuning for performance

Map-Reduce Internals

  • Anatomy of a MapReduce Job Run
    • Classic MapReduce (MapReduce 1)
    • YARN (MapReduce 2)
  • Failures
    • Failures in Classic MapReduce
    • Failures in YARN
  • Job Scheduling
    • The Fair Scheduler
    • The Capacity Scheduler
  • Shuffle and Sort
    • The Map Side
    • The Reduce Side
    • Configuration Tuning
  • Task Execution
    • The Task Execution Environment
    • Speculative Execution
    • Output Committers
    • Task JVM Reuse
    • Skipping Bad Records

Managing Hadoop

  • Setting up parameter values for practical use
  • Checking system's health
  • Setting permissions
  • Managing quotas
  • Enabling trash
  • Removing DataNodes
  • Adding DataNodes
  • Managing NameNode and Secondary NameNode
  • Recovering from a failed NameNode
  • Designing network layout and rack awareness
  • Map-Reduce Features
    • Counters
    • Sorting
    • Joins
    • Side Data Distribution
    • Map-Reduce Library

Map-Reduce Ecosystem

  • Pig
  • Thinking like a Pig
    • Data flow language
    • Data types
    • User-defined functions
  • Installing Pig
  • Running Pig
    • Managing the Grunt shell
    • Learning Pig Latin through Grunt
  • Speaking Pig Latin
    • Data types and schemas
    • Expressions and functions
    • Relational operators
    • Execution optimization
  • Hive
    • Installing and configuring Hive
    • Example queries
    • HiveQL in details
    • Hive Sum-up
  • Hbase
    • Intoduction
    • Concepts
    • Clients
    • Hbase vs RDBMS

Customer Reviews


Thanks to Xpertised and the tutor who walked me through all the topics with Practical exposure which is helping me in my current project.
-Waseem

Course was quite helpful in terms of understanding of concepts and practicality. Its really a very friendly environment to learn. The timing were mutually chosen, as we both are working professional. I am quite satisfied with the course.
-Tanmoy

...more
Share:

For Batch Details
Call us at: +91 7259222234

Not sure? Consult Our Experts

Looking for a Training for

Myself

My Team/Organization

I agree to be contacted over mail or phone

or
Call us at: +91 7259222234