Description
In this course, you will :
-
Use DataFrames and Structured Streaming in Spark 3
-
Frame big data analysis problems as Spark problems
-
Use Amazon's Elastic MapReduce service to run your job on a cluster with Hadoop YARN
-
Install and run Apache Spark on a desktop computer or on a cluster
-
Use Spark's Resilient Distributed Datasets to process and analyze large data sets across many CPU's
-
Implement iterative algorithms such as breadth-first-search using Spark
-
Use the MLLib machine learning library to answer common data mining questions
-
Understand how Spark SQL lets you work with structured data
-
Understand how Spark Streaming lets your process continuous streams of data in real time
-
Tune and troubleshoot large jobs running on a cluster
-
Share information between nodes on a Spark cluster using broadcast variables and accumulators
-
Understand how the GraphX library helps with network analysis problems
Syllabus :
1. Spark Basics and the RDD Interface
- What's new in Spark 3?
- Introduction to Spark
- The Resilient Distributed Dataset (RDD)
- Ratings Histogram Walkthrough
- Key/Value RDD's, and the Average Friends by Age Example
- Filtering RDD's, and the Minimum Temperature by Location Example
- Check Your Sorted Implementation and Results Against Mine.
2. SparkSQL, DataFrames, and DataSets
- Introducing SparkSQL
- Executing SQL commands and SQL-style functions on a DataFrame
- Using DataFrames instead of RDD's
- Exercise Solution: Friends by Age, with DataFrames
- Exercise Solution: Total Spent by Customer, with DataFrames
3. Advanced Examples of Spark Programs
- Find the Most Popular Superhero in a Social Graph
- Exercise Solution: Most Obscure Superheroes
- Superhero Degrees of Separation: Introducing Breadth-First Search
- Superhero Degrees of Separation: Accumulators, and Implementing BFS in Spark
- Item-Based Collaborative Filtering in Spark, cache(), and persist()
4. Running Spark on a Cluster
- Introducing Elastic MapReduce
- Partitioning
- Create Similar Movies from One Million Ratings
- Troubleshooting Spark on a Cluster
- More Troubleshooting, and Managing Dependencies
5. Machine Learning with Spark ML
- Introducing MLLib
- Analyzing the ALS Recommendations Results
6. Spark Streaming, Structured Streaming, and GraphX
- Spark Streaming
- Exercise Solution: Using Structured Streaming with Windows
- GraphX