Description
In this course, you will learn:
- About the Scala features most useful to data scientists, including custom functions, parallel processing, and programming Spark with Scala.
- How to use SQL from Scala—a particularly useful concept for data scientists, since they often have to extract data from relational databases.
- How to work with Resilient Distributed Datasets (RDDs)—a fundamental Spark data structure.
- How to use Scala with Spark DataFrames, a new class of data structure specially designed for analytic processing.
Syllabus:
- Introduction
- What you should know
- Using the exercise files
1. Introduction to Scala
- The advantages of Scala for data science
- Installing Scala
- Scala data types
- Scala collections
- Scala sets Scala arrays, vectors, and ranges
- Scala maps
- Scala expressions
- Scala functions
- Scala objects
2. Parallel Processing in Scala
- Advantages of parallel collections
- Creating parallel collections
- Mapping functions over parallel collections
- Filtering parallel collections
- When and when not to use parallel collections
3. Using SQL in Scala
- Installing PostgreSQL
- Loading data into PostgreSQL
- Connecting to PostgreSQL
- Querying with SQL strings
- Querying with prepared statements
- Summary of SQL in Scala
4. Scala and Spark RDDs
- Introduction to Spark
- Installing Spark
- Getting Started with Spark RDDs
- Mapping Functions over RDDs
- Statistics over RDDs
- Summary of Scala and Spark RDDs
5. Scala and Spark DataFrames
- Creating DataFrames
- Grouping and filtering on DataFrames
- Joining DataFrames
- Working with JSON files
- Summary of Scala and Spark DataFrames