Course Content
Big Data Overview:
- What is Big Data
- Why Big Data is gaining popularity
- Big Data Case Studies
- Big Data Characteristics
- Solutions to work on Big Data.
Hadoop & Its components:
- What is Hadoop and what are its components.
- Hadoop Architecture and its characteristics of Data it can handle /Process.
- Hadoop Frame work & its components- explained in detail.
- What is HDFS and Reads -Writes to Hadoop Distributed File System.
- How to Setup Hadoop Cluster in different modes- Pseudo/Multi Node cluster.
- This includes setting up Hadoop cluster in VM BOX/VMware or on individual machines,Network configurations that need to be carefully looked into, running Hadoop Daemons and testing the cluster.
- What is Map Reduce frame work and how it works.
- Running Map Reduce jobs on Hadoop cluster.
- Understanding Replication , Mirroring and Rack awareness in context of Hadoop .
- All the above topics include Demos and practice sessions for learners to have hands on experience on the technology.
Hadoop Cluster Planning:
- How to plan your hadoop cluster.
- experience on the technology.
- Understanding hardware-software to plan your hadoop cluster.
- Understanding workloads and planning cluster to avoid failures and perform optimum.
Working with Hadoop cluster- Hadoop Administration
- Understanding functionalities of JOB TRACKER –resource management and Job scheduling.
- Understanding Schedulers- Fair | FIFO | capacity scheduler
- Hadoop Administration commands to work on Hadoop clusters: Balancer | Job List, Status,Metadata & Data storage at specific locations | replication | Hadoop client | commissioning and decommissioning of data nodes and many more.
- Setting priority | Save namespace | Metasave | DFSadmin commands | FS commands |distcp | fsck |setting space quota | write /read access to HDFS | securing Hadoop cluster and many more.
- Backup and recovery
- Analyzing problems and resolving them : Some examples from live real time environments :
- Hadoop daemons not starting up | namespace IDs out of sync | connectivity issues between slave and master nodes | data being under replicated | browsing through respective UIs | job failures | etc.
Hadoop cluster with latest features:
- Hadoop 1.x and 2.x differences
- Hadoop 2.x new features
- What is Yarn, Federation and high Availability?
- Hadoop daemons and what has changed.
Working on Hadoop 2.x cluster:
- Upgrading Hadoop old versions ( 0.22.x or 1.X.X) to Hadoop 2.X in different modes
- Setting up Hadoop 2.x clusters in different modes and verifying the setup.
- Running Map Reduce jobs on Hadoop Yarn.
- A revisit on Hadoop configuration files, deprecated parameters, add on’s to existing config files and miscellaneous.
- What is cloudera Manager and how is it used.
- Comparing Hadoop Distributions
Hive
Introduction to Hive
- What Is Hive?
- Hive Schema and Data Storage
- Comparing Hive to Traditional Databases
- What is PIG
- Hive vs. Pig
- Hive Use Cases
- Interacting with Hive
Relational Data Analysis with Hive
- Hive Data Formats
- Basic HiveQL Syntax
- Data Types
- Joining Data Sets
- Functions
- Hands-On Exercise: Running Hive Queries on the Shell, Scripts, and Hue
Hive Data Management
- Hive Databases and Tables
- Creating Databases and Hive-Managed Tables
- Loading Data into Hive
- Altering Databases and Tables
- Self-Managed Tables
- Simplifying Queries with Views
- Controlling Access to Data
Text Processing with Hive
- Overview of Text Processing
- Important String Functions
- Using Regular Expressions in Hive
- Sentiment Analysis and N-Grams
- Gaining Insight with Sentiment Analysis
Hive Optimization
- Understanding Query Performance
- Controlling Job Execution Plan
- Partitioning
- Bucketing
- Indexing Data
Extending Hive
- SerDes
- Data Transformation with Custom Scripts
- Parameterized Queries
- Data Transformation with Hive
HBASE
Introduction to HBASE & Its architecture
- subtopics
Understanding HBASE INTERNALS
- subtopics
HBase Cluster planning and installing
- subtopics
HBase schema & row designing, I/O considerations. Advanced configurations.
- subtopics
The HBase Administration Client API basics and advanced features.
- subtopics
Working with Data.
- subtopics
HBase cluster monitoring using frameworks & Tools
- subtopics
HBase cluster Administration Operational scenarios and tasks.
- subtopics
Case studies in details with standard practices.
Impala
Introduction to Impala
- What is Impala?
- How Impala Differs from Hive and Pig
- How Impala Differs from Relational Databases
- Limitations and Future Directions
- Using the Impala Shell
Analyzing Data with Impala
- Basic Syntax
- Data Types
- Filtering, Sorting, and Limiting Results
- Joining and Grouping Data
- Improving Impala Performance
- Hands-On Exercise: Interactive Analysis with Impala
Choosing the Best Tool for the Job
- Comparing MapReduce, Pig, Hive, Impala, and Relational Databases
- Which to Choose?