Description
In this course, you will learn:
- How it can enhance your data science work.
- How it's the platform's answer to many big data challenges
- This practical, hands-on course helps you get comfortable with PySpark
- The Spark ecosystem, detailing its advantages over other data science platforms, APIs, and tool sets.
- Resilient Distributed Datasets (RDDs), the building blocks of Spark.
Syllabus:
- Introduction
- Apache PySpark
- What you should know
1. Introduction to Apache Spark
- The Apache Spark ecosystem
- Why Spark?
- Spark origins and Databricks
- Spark components
- Partitions, transformations, lazy evaluations, and actions
2. Technical Setup
- Set up the lab environment
- Download a dataset
- Importing
3. Working with the DataFrame API
- The DataFrame API
- Working with DataFrames
- Schemas
- Working with columns
- Working with rows
- Challenge
- Solution
4. Functions
- Built-in functions
- Working with dates
- User-defined functions
- Working with joins
- Challenge
- Solution
5. Resilient Distributed Datasets (RDDs)
- RDDs
- Working with RDDs