Description
In this course, you will learn:
- How to leverage these two technologies to build scalable and optimized data analytics pipelines. Instructor Kumaran Ponnambalam explores ways to optimize data modeling and storage on HDFS; discusses scalable data ingestion and extraction using Spark; and provides tips for optimizing data processing in Spark.
Syllabus:
- Introduction
1. Introduction and Setup
- Apache Hadoop overview
- Apache Spark overview
- Integrating Hadoop and Spark
- Setting up the environment
- Using exercise files
2. HDFS Data Modeling for Analytics
- Storage formats
- Compression
- Partitioning
- Bucketing
- Best practices for data storage
3. Data Ingestion with Spark
- Reading external files into Spark
- Writing to HDFS
- Parallel writes with partitioning
- Parallel writes with bucketing
- Best practices for ingestion
4. Data Extraction with Spark
- How Spark works
- Reading HDFS files with schema
- Reading partitioned data
- Reading bucketed data
- Best practices for data extraction
5. Optimizing Spark Processing
- Pushing down projections
- Pushing down filters
- Managing partitions
- Managing shuffling
- Improving joins
- Storing intermediate results
- Best practices for data processing
6. Use Case Project
- Problem definition
- Data loading
- Total score analytics
- Average score analytics
- Top student analytics