Description
In this course, you will :
- Learn the fundamentals of Apache Spark and an overview of its components.
- Learn Advanced Transformations and how to use Spark SQL, Spark's powerful library.
- Get hands-on experience with examples, coding, and recipes.
- Using Spark, create a big data batch application with foundations in both design patterns and good programming practises.
Syllabus :
1. Spark Introduction and Basics
- Spark Fundamentals
- Components and Architecture
- Spark and Big Data
- Spark's Java Main Abstraction: The DataFrame
2. Getting Started with Spark
- Running the First Spark Program
- Spark Maven Based Projects
- Enriching the Basic DataFrame Program
- Deep Dive: Transformations and Data Storage
3. DataFrame Basic Operations
- Working with DataFrame's Schemas
- Dataset: a DataFrame of POJOs
- Transformations and Actions
- Transformations (I): Map and Filter
- Actions (I): Count, Take, and Collect
- Deep Dive: Internals of Spark Execution
- Transformations (II): FlatMap and Distinct
- Actions (II): Reduce and Aggregate Functions: Max, Min, and Mean
4. DataFrame Advanced Operations
- Data Partitioning and Shuffling
- The groupBy and groupByKey methods
- Joins
- Sort and OrderBy
- Union, UnionByName, and DropDuplicates
- Accumulators and Broadcast Variables
- UDFs: User-defined Functions
5. Spark SQL and Other Functionalities
- Spark SQL Goodness
- Schema Manipulation
- How to Ingest Files
- Ingesting Databases
- Exporting Information
- Serialization: Working through the Wire
6. Building a Big Data Batch Application
- The Application Architecture Ecosystem
- Driver Program Design and Project Structure
- Driver Program and Job Implementation
- Ingestion Job
- Batch Pipelines and Other Types of Jobs
- Testing and Spark
7. Deployment and Cluster Execution
- Local and Cluster-based Execution
- Deploying and Running a Spark Application
8. Monitoring and Performance Fundamentals
- Interpreting Spark Logs
- Cluster Monitoring and SparkUI
- Performance Fundamentals and Recipes