Description
In this course, you will :
-
Use functional style Java to define complex data processing jobs
-
Learn the differences between the RDD and DataFrame APIs
-
Use an SQL style syntax to produce reports against Big Data sets
-
Use Machine Learning Algorithms with Big Data and SparkML
-
Connect Spark to Apache Kafka to process Streams of Big Data
-
See how Structured Streaming can be used to build pipelines with Kafka
Syllabus :
1. Getting Started
- Warning - Java 9+ is not supported by Spark 2. You can optionally use Spark 3.
- Installing Spark
2. Reduces on RDDs
- Reduces on RDDs
3. Mapping and Outputting
- Mapping Operations
- Outputting Results to the Console
- Counting Big Data Items
- If you've had a "NotSerializableException" in Spark
4. Tuples
- RDDs of Objects
- Tuples and RDDs
5. PairRDDs
- Overview of PairRDDs
- Building a PairRDD
- Coding a ReduceByKey
- Using the Fluent API
- Grouping By Key
6. FlatMaps and Filters
- FlatMaps
- Filters
7. Reading from Disk
- Reading from Disk
8. Keyword Ranking Practical
- Practical Requirements
- Worked Solution
- Worked Solution (continued) with Sorting
9. Sorts and Coalesce
- Why do sorts not work with foreach in Spark?
- Why Coalesce is the Wrong Solution
- What is Coalesce used for in Spark?
10. Deploying to AWS EM
- How to start an EMR Spark Cluster
- Packing a Spark Jar for EMR
- Running a Spark Job on EMR
- Understanding the Job Progress Output
- Calculating EMR costs and Terminating the cluster
11. Joins
- Inner Joins
- Left Outer Joins and Optionals
- Right Outer Joins
- Full Joins and Cartesians
12 Big Data Big Exercise
- Introducing the Requirements
- Warmup
- Main Exercise Requirments
- Walkthrough
- adding titles and using the Big Data file
13. RDD Performance
- Transformations and Actions
- The DAG and SparkUI
- Narrow vs Wide Transformations
- Shuffles
- Dealing with Key Skews
- Avoiding groupByKey and using map-side-reduces instead
- Caching and Persistence
14. SparkSQL Introduction
- Code for SQL/DataFrames Section
- Introducing SparkSQL
15. SparkSQL Getting Started
- SparkSQL Getting Started
16. Datasets
- Dataset Basics
- Filters using Expressions
- Filters using Lambdas
- Filters using Columns
17. The Full SQL Syntax
- Using a Spark Temporary View for SQL
18. In Memory Data
- In Memory Data
19. Groupings and Aggregations
- Groupings and Aggregations
20. Date Formatting
- Date Formatting
21. Multiple Groupings
- Multiple Groupings
22. Ordering
- Ordering
23. DataFrames API
- SQL vs DataFrames
- DataFrame Grouping
24. Pivot Tables
- How does a Pivot Table work?
- Coding a Pivot Table in Spark
25. More Aggregations
- How to use the agg method in Spark
26. User Defined Functions
- How to use a Lambda to write a UDF in Spark
- Using more than one input parameter in Spark UDF
- Using a UDF in Spark SQL
27. SparkSQL Performance
- Understand the SparkUI for SparkSQL
- How does SQL and DataFrame performance compare?
- Update - Setting spark.sql.shuffle.partitions
28. HashAggregation
- Explaining Execution Plans
- How does HashAggregation work?
- How can I force Spark to use HashAggregation?
- SQL vs DataFrames Performance Results
29. SparkSQL Performance vs RDDs
- SparkSQL Performance vs RDDs
30. SparkML for Machine Learning
- Welcome to Module 3
- What is Machine Learning?
- Coming up in this Module - and introducing Kaggle
- Supervised vs Unsupervised Learning
- The Model Building Process
31. Linear Regression Models
- Introducing Linear Regression
- Beginning Coding Linear Regressions
- Assembling a Vector of Features
- Model Fitting
32. Training Data
- Training vs Test and Holdout Data
- Using data from Kaggle
- Practical Walkthrough
- Splitting Training Data with Random Splits
- Assessing Model Accuracy with R2 and RMSE
33. Model Fitting Parameters
- Setting Linear Regression Parameters
- Training, Test and Holdout Data
34. Feature Selection
- Describing the Features
- Correlation of Fetures
- Identifying and Eliminating Duplicated Features
- Data Preparation
35. Non-Numeric Data
- Using OneHotEncoding
- Understanding Vectors
36. Pipelines
- Pipelines
37. Logistic Regression
- Code for chapters
- True/False Negatives and Postives
- Coding a Logistic Regression
38. Decision Trees
- Overview of Decision Trees
- Building the Model
- Interpreting a Decision Tree
- Random Forests
39. K Means Clustering
- K Means Clustering
40. Recommender Systems
- Overview and Matrix Factorisation
- Building the Model
41. Spark Streaming and Structured Streaming with Kafka
- Spark Streaming
- Introduction to Streaming
- DStreams
- Starting a Streaming Job
- Streaming Transformations
- Streaming Aggregations
- SparkUI for Streaming Jobs
- Windowing Batches
42. Streaming with Apache Kafka
- Overview of Kafka
- Installing Kafka
- Using a Kafka Event Simulator
- Integrating Kafka with Spark
- Using KafkaUtils to access a DStream
- Writing a Kafka Aggegration
- Adding a Window
- Adding a Slide Interval
43. Structured Streaming
- Structured Streaming Overview
- Data Sinks
- Structured Streaming Output Modes
- Windows and Watermarks
- What is the Batch Size in Structured Streaming?
- Kafka Structured Streaming Pipelines