Description
In this course, you will :
- Understand the Apache Spark framework, execution model, and programming model for developing Big Data Systems.
- Learn how to set up and configure Spark using a free cloud-based and desktop machine.
- Using real-world case studies, create simple to advanced Big Data applications for various types of data (volume, variety, and veracity).
- Learn how to use RDD, DataFrame, and SQL to perform step-by-step hands-on PySpark practises on structured, unstructured, and semi-structured data.
- Investigate and implement optimization and performance tuning methods for managing data skewness and preventing Spill.
- Examine and implement Adaptive Query Execution (AQE) to optimise Spark SQL query execution at runtime.
Syllabus :
- PySpark for a large Semi-Structured (JSON) File
- PySpark for a large Structured File
- PySpark for a large Unstructured (LOG) File
- Distributed Processing Challenges and Spark Performance Tuning