Course Content
Hadoop Introduction
- Move computation not data
- Hadoop performance and data scale facts
- Hadoop in the context of other data stores
- The Apache Hadoop Project
- Hadoop - an inside view: MapReduce and HDFS
- The Hadoop Ecosystem
- What about NoSQL?
- Comparison with Other Systems
- RDBMS
- Grid Computing
- Volunteer Computing
- A Brief History of Hadoop
- Apache Hadoop and the Hadoop Ecosystem
- Hadoop Releases
MapReduce
- Analyzing the Data with Hadoop
- Map and Reduce
- Java MapReduceScaling Out
- Data FlowCombiner Functions
- Running a Distributed MapReduce Job
- Hadoop Streaming
- Ruby
- Python
- Hadoop Pipes
- Constructing the basic template of a MapReduce program
- Counting things
- Adapting for Hadoop's API changes
- Streaming in Hadoop
- Streaming with Unix commands
- Streaming with scripts
- Streaming with key/value pairs
- Streaming with the Aggregate package
- Improving performance with combiners
Distributing Data with HDFS
- The Design of HDFS
- HDFS Concepts
- Blocks
- Namenodes and Datanodes
- HDFS Federation
- HDFS High-Availability
- The Command-Line Interface
- Basic Filesystem Operations
- Hadoop Filesystems
- Interfaces
- The Java Interface
- Reading Data from a Hadoop URL
- Reading Data Using the FileSystem API
- Writing Data
- Directories
- Querying the Filesystem
- Deleting Data
- Data Flow
- Anatomy of a File Read
- Anatomy of a File Write
- Coherency Model
- Parallel Copying with distcp
- Keeping an HDFS Cluster Balanced
- Hadoop Archives
- Using Hadoop Archives
- Limitations
Understanding Hadoop I/O
- Data Integrity
- Data Integrity in HDFS
- LocalFileSystem
- ChecksumFileSystem
- Compression
- Codecs
- Compression and Input Splits
- Using Compression in MapReduce
- Serialization
- The Writable Interface
- Writable Classes
- Implementing a Custom Writable
- Serialization Frameworks
- Avro
- File-Based Data Structures
- SequenceFile
- MapFile
Advanced MapReduce
- Chaining MapReduce jobs
- Chaining MapReduce jobs in a sequence
- Chaining MapReduce jobs with complex dependency
- Chaining preprocessing and postprocessing steps
- Joining data from different sources
- Reduce-side joining
- Replicated joins using DistributedCache
- Semijoin: reduce-side join with map-side filtering
- Creating a Bloom filter
- What does a Bloom filter do?
- Implementing a Bloom filter
- Bloom filter in Hadoop version 0.20+
Writing Map-Reduce Applications
- The Configuration API
- Configuring the Development Environment
- Running Locally on Test Data
- Cluster Specs
- Cluster Setup and Installation
- Hadoop Configuration
- YARN Configuration
- Benchmarking a Hadoop Cluster
- Hadoop in the Cloud
- Tuning
- MapReduce Workflows
- Monitoring and debugging on a production cluster
- Tuning for performance
Map-Reduce Internals
- Anatomy of a MapReduce Job Run
- Classic MapReduce (MapReduce 1)
- YARN (MapReduce 2)
- Failures
- Failures in Classic MapReduce
- Failures in YARN
- Job Scheduling
- The Fair Scheduler
- The Capacity Scheduler
- Shuffle and Sort
- The Map Side
- The Reduce Side
- Configuration Tuning
- Task Execution
- The Task Execution Environment
- Speculative Execution
- Output Committers
- Task JVM Reuse
- Skipping Bad Records
Managing Hadoop
- Setting up parameter values for practical use
- Checking system's health
- Setting permissions
- Managing quotas
- Enabling trash
- Removing DataNodes
- Adding DataNodes
- Managing NameNode and Secondary NameNode
- Recovering from a failed NameNode
- Designing network layout and rack awareness
- Map-Reduce Features
- Counters
- Sorting
- Joins
- Side Data Distribution
- Map-Reduce Library
Map-Reduce Ecosystem
- Pig
- Thinking like a Pig
- Data flow language
- Data types
- User-defined functions
- Installing Pig
- Running Pig
- Managing the Grunt shell
- Learning Pig Latin through Grunt
- Speaking Pig Latin
- Data types and schemas
- Expressions and functions
- Relational operators
- Execution optimization
- Hive
- Installing and configuring Hive
- Example queries
- HiveQL in details
- Hive Sum-up
- Hbase
- Intoduction
- Concepts
- Clients
- Hbase vs RDBMS
- Thinking like a Pig