Description
In this course, you will learn :
- Storing Big Data in HDFS
- Transformations and Actions in Spark
- Data Ingestion using Sqoop and Flume
- Querying Big Data using Spark SQL
- Building Data Pipeline using Kafka
- Real-time Data Processing with Spark
Syllabus :
1. Introduction to Big Data Hadoop and Spark
- What is Big Data?
- Big Data Customer Scenarios
- Limitations and Solutions of Existing Data Analytics Architecture with Uber Use Case
- How Hadoop Solves the Big Data Problem?
- What is Hadoop?
- Hadoop’s Key Characteristics
- Hadoop Ecosystem and HDFS
- Hadoop Core Components
- Rack Awareness and Block Replication
- YARN and its Advantage
- Hadoop Cluster and its Architecture
- Hadoop: Different Cluster Modes
- Big Data Analytics with Batch & Real-Time Processing
- Why Spark is Needed?
- What is Spark?
- How Spark Differs from its Competitors?
- Spark at eBay
- Spark’s Place in Hadoop Ecosystem
2. Introduction to Python for Apache Spark
- Overview of Python
- Different Applications where Python is Used
- Values, Types, Variables
- Operands and Expressions
- Conditional Statements
- Loops
- Command Line Arguments
- Writing to the Screen
- Python files I/O Functions
- Numbers
- Strings and related operations
- Tuples and related operations
- Lists and related operations
- Dictionaries and related operations
- Sets and related operations
3. Functions, OOPs, and Modules in Python
- Functions
- Function Parameters
- Global Variables
- Variable Scope and Returning Values
- Lambda Functions
- Object-Oriented Concepts
- Standard Libraries
- Modules Used in Python
- The Import Statements
- Module Search Path
- Package Installation Ways
4. Deep Dive into Apache Spark Framework
- Spark Components & its Architecture
- Spark Deployment Modes
- Introduction to PySpark Shell
- Submitting PySpark Job
- Spark Web UI
- Writing your first PySpark Job Using Jupyter Notebook
- Data Ingestion using Sqoop
5. Playing with Spark RDDs
- Challenges in Existing Computing Methods
- Probable Solution & How RDD Solves the Problem
- What is RDD, Its Operations, Transformations & Actions
- Data Loading and Saving Through RDDs
- Key-Value Pair RDDs
- Other Pair RDDs, Two Pair RDDs
- RDD Lineage
- RDD Persistence
- WordCount Program Using RDD Concepts
- RDD Partitioning & How it Helps Achieve Parallelization
- Passing Functions to Spark
6. DataFrames and Spark SQL
- Need for Spark SQL
- What is Spark SQL
- Spark SQL Architecture
- SQL Context in Spark SQL
- Schema RDDs
- User Defined Functions
- Data Frames & Datasets
- Interoperating with RDDs
- JSON and Parquet File Formats
- Loading Data through Different Sources
- Spark-Hive Integration
7. Machine Learning using Spark MLlib
- Why Machine Learning?
- What is Machine Learning?
- Where Machine Learning is Used?
- Face Detection: USE CASE
- Different Types of Machine Learning Techniques
- Introduction to MLlib
- Features of MLlib and MLlib Tools
- Various ML algorithms supported by MLlib
8. Deep Dive into Spark MLlib
- Supervised Learning - Linear Regression, Logistic Regression, Decision Tree, Random Forest
- Unsupervised Learning - K-Means Clustering & How It Works with MLlib
- Analysis on US Election Data using MLlib (K-Means)
9. Understanding Apache Kafka and Apache Flume
- Need for Kafka
- What is Kafka
- Core Concepts of Kafka
- Kafka Architecture
- Where is Kafka Used
- Understanding the Components of Kafka Cluster
- Configuring Kafka Cluster
- Kafka Producer and Consumer Java API
- Need of Apache Flume
- What is Apache Flume
- Basic Flume Architecture
- Flume Sources
- Flume Sinks
- Flume Channels
- Flume Configuration
- Integrating Apache Flume and Apache Kafka
10. Apache Spark Streaming - Processing Multiple Batches
- Drawbacks in Existing Computing Methods
- Why Streaming is Necessary
- What is Spark Streaming
- Spark Streaming Features
- Spark Streaming Workflow
- How Uber Uses Streaming Data
- Streaming Context & DStreams
- Transformations on DStreams
- Describe Windowed Operators and Why it is Useful
- Important Windowed Operators
- Slice, Window and ReduceByWindow Operators
- Stateful Operators
11. Apache Spark Streaming - Data Sources
- Apache Spark Streaming: Data Sources
- Streaming Data Source Overview
- Apache Flume and Apache Kafka Data Sources
- Example: Using a Kafka Direct Data Source