Description
In this course, you will :
- Gain a hands-on understanding of Apache Spark and use it to solve machine learning problems involving both small and large amounts of data.
- Comprehend how parallel code, capable of running on thousands of CPUs, is written.
- Apply machine learning algorithms on Petabytes of data using Apache SparkML Pipelines on large scale compute clusters.
- Eliminate out-of-memory errors caused by traditional machine learning frameworks when data does not fit in the main memory of a computer.
- Test thousands of different ML models in parallel to find the best performing one, as many successful Kagglers do.
- Use Apache SparkSQL and the Apache Spark DataFrame API to run SQL statements on very large data sets.
Syllabus :
1. Introduction
- Introduction to Apache Spark for Machine Learning on BigData
- What is Big Data?
- Data storage solutions
- Parallel data processing strategies of Apache Spark
- Functional programming basics
- Resilient Distributed Dataset and DataFrames - ApacheSparkSQL
2. Scaling Math for Statistics on Apache Spark
- Averages
- Standard deviation
- Skewness
- Kurtosis
- Covariance, Covariance matrices, correlation
- Plotting with ApacheSpark and python's matplotlib
- Dimensionality reduction
- PCA
3. Introduction to Apache SparkML
- How ML Pipelines work
- Introduction to SparkML
- Extract - Transform - Load
- Introduction to Clustering: k-Means
- Using K-Means in Apache SparkML
4. Supervised and Unsupervised learning with SparkML
- Linear Regression
- LinearRegression with Apache SparkML
- Logistic Regression
- LogisticRegression with Apache SparkML