Description
In this course, you will learn :
- About Big Data fundamentals like YARN (yet another resource manager), MapReduce, HDFS (Hadoop Distributed File System), and Spark.
- Have plenty of opportunities to get your hands dirty with working Hadoop clusters throughout this course.
- You will begin by learning about the rise of Big Data as well as the various types of data such as structured, unstructured, and semi-structured data.
Syllabus :
1. Hadoop
- Introduction
- Rise of Big Data
- Types of Data
- Big Data Defined
- Big Data vs Data Warehouse
2. YARN
- Introduction
- Workflow
- Scheduling
3. Map Reduce
- Basics
- Mapper
- Testing Mapper
- Mapper Input
- Reducer
- Testing Reducer
- Testing MapReduce Program
- Running MapReduce End to End
- Exploring MapReduce Runs
- Combiner and Partitioner
- Putting it Together
- Resiliency
4. HDFS
- Filesystem
- The Big Picture
- Disk Blocks & HDFS Blocks
- Block Replication
- Namenode
- Datanode
- Writing and Reading
- High Availability
- HDFS in Practice
- Distcp
5. Spark
- Introduction
- Architecture
- Spark Application Life Cycle
- Spark API
- Resilient Distributed Datasets (RDDs)
- DataFrames
- Datasets
- An Example
- Running Spark Applications
- Anatomy of a Spark Application
- Execution of a Spark Application
6. Input & Output Formats
- Sequence File: Intro
- Sequence File: Reading & Writing
- SerDe
- Rows vs Columnar Databases
- Avro: Intro
- Avro: Code Generation
- Avro: IDL & RPC
- Parquet: Intro
- Parquet: Definition Level
- Parquet: Repetition Level
- Parquet: Reading & Writing
- Parquet: Projection Schema & Misc. Tools
7. Misc
- Zookeeper: Intro
- Zookeeper: Example
- Zookeeper: Practical