Description
In this course, you will learn :
- Scala is a programming language that can be used to create distributed code.
- SparkSQL, DataSets, and DataFrames are used to transform structured data.
- Big data analysis problems should be framed as Apache Spark scripts.
- Partitioning, caching, and other techniques can be used to optimise Spark jobs.
- Spark scripts can be built, deployed, and run on Hadoop clusters.
- Spark Streaming is used to process continuous streams of data.
- GraphX is a tool for traversing and analysing graph structures.
- Machine Learning on Spark can be used to analyse large amounts of data.
Syllabus :
1. Scala Crash Course
- [Activity] Scala Basics
- [Exercise] Flow Control in Scala
- [Exercise] Functions in Scala
- [Exercise] Data Structures in Scala
2. Using Resilient Distributed Datasets (RDDs)
- The Resilient Distributed Dataset
- Ratings Histogram Example
- Spark Internals
- Key / Value RDD's, and the Average Friends by Age example
- [Activity] Running the Average Friends by Age Example
- Filtering RDD's, and the Minimum Temperature by Location Example
- [Activity] Running the Minimum Temperature Example, and Modifying it for Maximum
- [Activity] Counting Word Occurrences using Flatmap()
- [Activity] Improving the Word Count Script with Regular Expressions
- [Activity] Sorting the Word Count Results
- [Exercise] Find the Total Amount Spent by Customer
- [Exercise] Check your Results, and Sort Them by Total Amount Spent
- Check Your Results and Implementation Against Mine
3. SparkSQL, DataFrames, and DataSets
- [Activity] Using SparkSQL
- [Activity] Using DataSets
- [Exercise] Implement the "Friends by Age" example using DataSets
- Exercise Solution: Friends by Age, with Datasets.
- [Activity] Word Count example, using Datasets
- [Activity] Revisiting the Minimum Temperature example, with Datasets
- [Exercise] Implement the "Total Spent by Customer" problem with Datasets
4. Advanced Examples of Spark Programs
- [Activity] Find the Most Popular Movie
- [Activity] Use Broadcast Variables to Display Movie Names
- [Activity] Find the Most Popular Superhero in a Social Graph
- [Exercise] Find the Most Obscure Superheroes
- Exercise Solution: Find the Most Obscure Superheroes
- Superhero Degrees of Separation: Introducing Breadth-First Search
- Superhero Degrees of Separation: Accumulators, and Implementing BFS in Spark
- [Activity] Superhero Degrees of Separation: Review the code, and run it!
- Item-Based Collaborative Filtering in Spark, cache(), and persist()
- [Activity] Running the Similar Movies Script using Spark's Cluster Manager
- [Exercise] Improve the Quality of Similar Movies
5. Running Spark on a Cluster
- [Activity] Using spark-submit to run Spark driver scripts
- [Activity] Packaging driver scripts with SBT
- [Exercise] Package a Script with SBT and Run it Locally with spark-submit
- Exercise solution: Using SBT and spark-submit
- Introducing Amazon Elastic MapReduce
- Creating Similar Movies from One Million Ratings on EMR
- Partitioning
- Best Practices for Running on a Cluster
- Troubleshooting, and Managing Dependencies
6. Machine Learning with Spark ML
- Introducing MLLib
- [Activity] Using MLLib to Produce Movie Recommendations
- Linear Regression with MLLib
- [Activity] Running a Linear Regression with Spark
- [Exercise] Predict Real Estate Values with Decision Trees in Spark
7. Intro to Spark Streaming
- The DStream API for Spark Streaming
- [Activity] Real-time Monitoring of the Most Popular Hashtags on Twitter
- Structured Streaming
- [Activity] Using Structured Streaming for real-time log analysis
- [Exercise] Windowed Operations with Structured Streaming
- Exercise Solution: Top URL's in a 30-second Window
8. Intro to GraphX
- GraphX, Pregel, and Breadth-First-Search with Pregel.
- Using the Pregel API with Spark GraphX
- [Activity] Superhero Degrees of Separation using GraphX