Want to master your topic faster? Let AI build your personalized course

📚 Smarter courses, 🔍 adaptive quizzes, 🎓 real certificates.

Coursesity is supported by learner community. We may earn affiliate commission when you make purchase via links on Coursesity.

Certification Course

Apache Spark for Java Developers

Get processing Big Data using RDDs, DataFrames, SparkSQL and Machine Learning - and real time streaming with Kafka!

10.4K

total enrollments

4.6

( 1.8K )

Total ratings

Discount Offer

Go to Course SAVE

Course Overview
Reviews

Description

In this course, you will :

Use functional style Java to define complex data processing jobs
Learn the differences between the RDD and DataFrame APIs
Use an SQL style syntax to produce reports against Big Data sets
Use Machine Learning Algorithms with Big Data and SparkML
Connect Spark to Apache Kafka to process Streams of Big Data
See how Structured Streaming can be used to build pipelines with Kafka

Syllabus :

1. Getting Started

Warning - Java 9+ is not supported by Spark 2. You can optionally use Spark 3.
Installing Spark

2. Reduces on RDDs

Reduces on RDDs

3. Mapping and Outputting

Mapping Operations
Outputting Results to the Console
Counting Big Data Items
If you've had a "NotSerializableException" in Spark

4. Tuples

RDDs of Objects
Tuples and RDDs

5. PairRDDs

Overview of PairRDDs
Building a PairRDD
Coding a ReduceByKey
Using the Fluent API
Grouping By Key

6. FlatMaps and Filters

FlatMaps
Filters

7. Reading from Disk

Reading from Disk

8. Keyword Ranking Practical

Practical Requirements
Worked Solution
Worked Solution (continued) with Sorting

9. Sorts and Coalesce

Why do sorts not work with foreach in Spark?
Why Coalesce is the Wrong Solution
What is Coalesce used for in Spark?

10. Deploying to AWS EM

How to start an EMR Spark Cluster
Packing a Spark Jar for EMR
Running a Spark Job on EMR
Understanding the Job Progress Output
Calculating EMR costs and Terminating the cluster

11. Joins

Inner Joins
Left Outer Joins and Optionals
Right Outer Joins
Full Joins and Cartesians

12 Big Data Big Exercise

Introducing the Requirements
Warmup
Main Exercise Requirments
Walkthrough
adding titles and using the Big Data file

13. RDD Performance

Transformations and Actions
The DAG and SparkUI
Narrow vs Wide Transformations
Shuffles
Dealing with Key Skews
Avoiding groupByKey and using map-side-reduces instead
Caching and Persistence

14. SparkSQL Introduction

Code for SQL/DataFrames Section
Introducing SparkSQL

15. SparkSQL Getting Started

SparkSQL Getting Started

16. Datasets

Dataset Basics
Filters using Expressions
Filters using Lambdas
Filters using Columns

17. The Full SQL Syntax

Using a Spark Temporary View for SQL

18. In Memory Data

In Memory Data

19. Groupings and Aggregations

Groupings and Aggregations

20. Date Formatting

Date Formatting

21. Multiple Groupings

Multiple Groupings

22. Ordering

Ordering

23. DataFrames API

SQL vs DataFrames
DataFrame Grouping

24. Pivot Tables

How does a Pivot Table work?
Coding a Pivot Table in Spark

25. More Aggregations

How to use the agg method in Spark

26. User Defined Functions

How to use a Lambda to write a UDF in Spark
Using more than one input parameter in Spark UDF
Using a UDF in Spark SQL

27. SparkSQL Performance

Understand the SparkUI for SparkSQL
How does SQL and DataFrame performance compare?
Update - Setting spark.sql.shuffle.partitions

28. HashAggregation

Explaining Execution Plans
How does HashAggregation work?
How can I force Spark to use HashAggregation?
SQL vs DataFrames Performance Results

29. SparkSQL Performance vs RDDs

SparkSQL Performance vs RDDs

30. SparkML for Machine Learning

Welcome to Module 3
What is Machine Learning?
Coming up in this Module - and introducing Kaggle
Supervised vs Unsupervised Learning
The Model Building Process

31. Linear Regression Models

Introducing Linear Regression
Beginning Coding Linear Regressions
Assembling a Vector of Features
Model Fitting

32. Training Data

Training vs Test and Holdout Data
Using data from Kaggle
Practical Walkthrough
Splitting Training Data with Random Splits
Assessing Model Accuracy with R2 and RMSE

33. Model Fitting Parameters

Setting Linear Regression Parameters
Training, Test and Holdout Data

34. Feature Selection

Describing the Features
Correlation of Fetures
Identifying and Eliminating Duplicated Features
Data Preparation

35. Non-Numeric Data

Using OneHotEncoding
Understanding Vectors

36. Pipelines

Pipelines

37. Logistic Regression

Code for chapters
True/False Negatives and Postives
Coding a Logistic Regression

38. Decision Trees

Overview of Decision Trees
Building the Model
Interpreting a Decision Tree
Random Forests

39. K Means Clustering

K Means Clustering

40. Recommender Systems

Overview and Matrix Factorisation
Building the Model

41. Spark Streaming and Structured Streaming with Kafka

Spark Streaming
Introduction to Streaming
DStreams
Starting a Streaming Job
Streaming Transformations
Streaming Aggregations
SparkUI for Streaming Jobs
Windowing Batches

42. Streaming with Apache Kafka

Overview of Kafka
Installing Kafka
Using a Kafka Event Simulator
Integrating Kafka with Spark
Using KafkaUtils to access a DStream
Writing a Kafka Aggegration
Adding a Window
Adding a Slide Interval

43. Structured Streaming

Structured Streaming Overview
Data Sinks
Structured Streaming Output Modes
Windows and Watermarks
What is the Batch Size in Structured Streaming?
Kafka Structured Streaming Pipelines

Similar Courses

Reviews

No Reviews Available yet

Be the first to write a review

Course Features

30 days return
Certificate on completion
Udemy
English
Beginner
Self Paced
Development ,Apache Spark

Enrollment options

Course Material
Certificate on completion
30 days Refund (refund policy)
Lifetime Access
Instructor direct message
Instructor Q&A

Report course

Apache Spark for Java Developers

Get processing Big Data using RDDs, DataFrames, SparkSQL and Machine Learning - and real time streaming with Kafka!

10.4K

4.6

Discount Offer

Description

Similar Courses

An Introduction to Spark

Apache Spark Essential Training: Big Data Engineering

Taming Big Data with Apache Spark and Python - Hands On!

Apache Spark Streaming with Python and PySpark

Apache Spark Essential Training

Mastering Big Data with Apache Spark and Java

Scalable Machine Learning on Big Data using Apache Spark

Free Apache Spark Tutorial - Spark Starter Kit

Apache Spark 3 - Spark Programming in Python for Beginners

Reviews

No Reviews Available yet

Course Features

Enrollment options