Description
We'll go over Spark's programming model in depth, paying close attention to how and when it differs from other programming models, such as shared-memory parallel collections or sequential Scala collections. We'll learn when important distribution issues like latency and network communication should be considered, and how to address them effectively for improved performance, using hands-on examples in Spark and Scala.
Outcomes of Learning You will be able to do the following by the end of this course:
- read data from persistent storage and load it into Apache Spark,
- manipulate data with Spark and Scala,
- express algorithms for data analysis in a functional style,
- recognize how to avoid shuffles and recomputation in Spark,
Syllabus :
1. Getting Started + Spark Basics
- Introduction, Logistics, What You'll Learn
- Data-Parallel to Distributed Data-Parallel
- Latency
- RDDs, Spark's Distributed Collection
- RDDs: Transformation and Actions
- Evaluation in Spark: Unlike Scala Collections!
- Cluster Topology Matters!
2. Reduction Operations & Distributed Key-Value Pairs
- Reduction Operations
- Pair RDDs
- Transformations and Actions on Pair RDDs
- Joins
3. Partitioning and Shuffling
- Shuffling: What it is and why it's important
- Partitioning
- Optimizing with Partitioners
- Wide vs Narrow Dependencies
4. Structured data: SQL, Dataframes, and Datasets
- Structured vs Unstructured Data
- Spark SQL
- DataFrames
- Datasets