Description
In this course, you will learn :
- Spark Architecture and the Apache Spark Foundation.
- Spark Data Engineering and Processing.
- Using Data Sources and Sinks.
- Using Data Frames and Spark SQL.
- PyCharm IDE is being used for Spark development and debugging.
- Unit testing, application log management, and cluster deployment are all responsibilities.
Syllabus :
1. Understanding Big Data and Data Lake
- What is Big Data and How it Started
- Hadoop Architecture, History, and Evolution
- Introducing Apache Spark and Databricks Cloud
2. Installing and Using Apache Spark
- Spark Development Environments
- Setup your Databricks Community Cloud Environment
- Introduction to Databricks Workspace
- Create your First Spark Application in Databricks Cloud
- Setup your Local Development IDE
- Mac Users - Setup your Local Development IDE
- Create your First Spark Application using IDE
3. Spark Execution Model and Architecture
- Execution Methods - How to Run Spark Programs?
- Spark Distributed Processing Model - How your program runs?
- Spark Execution Modes and Cluster Managers
- Summarizing Spark Execution Models - When to use What?
- Working with PySpark Shell - Demo
- Installing Multi-Node Spark Cluster - Demo
- Working with Notebooks in Cluster - Demo
- Working with Spark Submit - Demo
4. Spark Programming Model and Developer Experience
- Creating Spark Project Build Configuration
- Configuring Spark Project Application Logs
- Check your knowledge
- Creating Spark Session
- Check your knowledge
- Configuring Spark Session
- Data Frame Introduction
- Data Frame Partitions and Executors
- Spark Transformations and Actions
- Spark Jobs Stages and Task
- Understanding your Execution Plan
- Unit Testing Spark Application
- Rounding off Summary
5. Spark Structured API Foundation
- Introduction to Spark APIs
- Introduction to Spark RDD API
- Working with Spark SQL
- Spark SQL Engine and Catalyst Optimizer
6. Spark Data Sources and Sinks
- Spark Data Sources and Sinks
- Spark DataFrameReader API
- Reading CSV, JSON and Parquet files
- Creating Spark DataFrame Schema
- Spark DataFrameWriter API
- Writing Your Data and Managing Layout
- Spark Databases and Tables
- Working with Spark SQL Tables
7. Spark Dataframe and Dataset Transformations
- Introduction to Data Transformation
- Working with Dataframe Rows
- DataFrame Rows and Unit Testing
- Dataframe Rows and Unstructured data
- Working with Dataframe Columns
- Creating and Using UDF
- Misc Transformations
8. Aggregations in Apache Spark
- Aggregating Dataframes
- Grouping Aggregations
- Windowing Aggregations
9. Spark Dataframe Joins
- Dataframe Joins and column name ambiguity
- Outer Joins in Dataframe
- Internals of Spark Join and shuffle
- Optimizing your joins
- Implementing Bucket Joins
10. Archived - Apache Spark Introduction
- Big Data History and Primer
- Understanding the Data Lake Landscape
- What is Apache Spark - An Introduction and Overview
11. Archived - Installing and Using Apache Spark
- Spark Development Environments
- Mac Users - Apache Spark in Local Mode Command Line REPL
- Windows Users - Apache Spark in Local Mode Command Line REPL
- Mac Users - Apache Spark in the IDE - PyCharm
- Windows Users - Apache Spark in the IDE - PyCharm
- Apache Spark in Cloud - Databricks Community and Notebooks