Description
In this course, you will learn :
- Storing Big Data in HDFS
 - Transformations and Actions in Spark
 - Data Ingestion using Sqoop and Flume
 - Querying Big Data using Spark SQL
 - Building Data Pipeline using Kafka
 - Real-time Data Processing with Spark
 
Syllabus :
1. Introduction to Big Data Hadoop and Spark
- What is Big Data?
 - Big Data Customer Scenarios
 - Limitations and Solutions of Existing Data Analytics Architecture with Uber Use Case
 - How Hadoop Solves the Big Data Problem?
 - What is Hadoop?
 - Hadoop’s Key Characteristics
 - Hadoop Ecosystem and HDFS
 - Hadoop Core Components
 - Rack Awareness and Block Replication
 - YARN and its Advantage
 - Hadoop Cluster and its Architecture
 - Hadoop: Different Cluster Modes
 - Big Data Analytics with Batch & Real-Time Processing
 - Why Spark is Needed?
 - What is Spark?
 - How Spark Differs from its Competitors?
 - Spark at eBay
 - Spark’s Place in Hadoop Ecosystem
 
2. Introduction to Python for Apache Spark
- Overview of Python
 - Different Applications where Python is Used
 - Values, Types, Variables
 - Operands and Expressions
 - Conditional Statements
 - Loops
 - Command Line Arguments
 - Writing to the Screen
 - Python files I/O Functions
 - Numbers
 - Strings and related operations
 - Tuples and related operations
 - Lists and related operations
 - Dictionaries and related operations
 - Sets and related operations
 
3. Functions, OOPs, and Modules in Python
- Functions
 - Function Parameters
 - Global Variables
 - Variable Scope and Returning Values
 - Lambda Functions
 - Object-Oriented Concepts
 - Standard Libraries
 - Modules Used in Python
 - The Import Statements
 - Module Search Path
 - Package Installation Ways
 
4. Deep Dive into Apache Spark Framework
- Spark Components & its Architecture
 - Spark Deployment Modes
 - Introduction to PySpark Shell
 - Submitting PySpark Job
 - Spark Web UI
 - Writing your first PySpark Job Using Jupyter Notebook
 - Data Ingestion using Sqoop
 
5. Playing with Spark RDDs
- Challenges in Existing Computing Methods
 - Probable Solution & How RDD Solves the Problem
 - What is RDD, Its Operations, Transformations & Actions
 - Data Loading and Saving Through RDDs
 - Key-Value Pair RDDs
 - Other Pair RDDs, Two Pair RDDs
 - RDD Lineage
 - RDD Persistence
 - WordCount Program Using RDD Concepts
 - RDD Partitioning & How it Helps Achieve Parallelization
 - Passing Functions to Spark
 
6. DataFrames and Spark SQL
- Need for Spark SQL
 - What is Spark SQL
 - Spark SQL Architecture
 - SQL Context in Spark SQL
 - Schema RDDs
 - User Defined Functions
 - Data Frames & Datasets
 - Interoperating with RDDs
 - JSON and Parquet File Formats
 - Loading Data through Different Sources
 - Spark-Hive Integration
 
7. Machine Learning using Spark MLlib
- Why Machine Learning?
 - What is Machine Learning?
 - Where Machine Learning is Used?
 - Face Detection: USE CASE
 - Different Types of Machine Learning Techniques
 - Introduction to MLlib
 - Features of MLlib and MLlib Tools
 - Various ML algorithms supported by MLlib
 
8. Deep Dive into Spark MLlib
- Supervised Learning - Linear Regression, Logistic Regression, Decision Tree, Random Forest
 - Unsupervised Learning - K-Means Clustering & How It Works with MLlib
 - Analysis on US Election Data using MLlib (K-Means)
 
9. Understanding Apache Kafka and Apache Flume
- Need for Kafka
 - What is Kafka
 - Core Concepts of Kafka
 - Kafka Architecture
 - Where is Kafka Used
 - Understanding the Components of Kafka Cluster
 - Configuring Kafka Cluster
 - Kafka Producer and Consumer Java API
 - Need of Apache Flume
 - What is Apache Flume
 - Basic Flume Architecture
 - Flume Sources
 - Flume Sinks
 - Flume Channels
 - Flume Configuration
 - Integrating Apache Flume and Apache Kafka
 
10. Apache Spark Streaming - Processing Multiple Batches
- Drawbacks in Existing Computing Methods
 - Why Streaming is Necessary
 - What is Spark Streaming
 - Spark Streaming Features
 - Spark Streaming Workflow
 - How Uber Uses Streaming Data
 - Streaming Context & DStreams
 - Transformations on DStreams
 - Describe Windowed Operators and Why it is Useful
 - Important Windowed Operators
 - Slice, Window and ReduceByWindow Operators
 - Stateful Operators
 
11. Apache Spark Streaming - Data Sources
- Apache Spark Streaming: Data Sources
 - Streaming Data Source Overview
 - Apache Flume and Apache Kafka Data Sources
 - Example: Using a Kafka Direct Data Source
 
