Description
In this course, you will learn :
- An overview of Apache Spark's architecture.
- To process and analyse large data sets, use Apache Spark's primary abstraction, resilient distributed datasets (RDDs).
- Create Apache Spark 2.0 applications with RDD transformations and actions, as well as Spark SQL.
- Scale Spark applications using Amazon's Elastic MapReduce service on a Hadoop YARN cluster.
- Analyze structured and semi-structured data with Datasets and DataFrames, and learn everything there is to know about Spark SQL.
- Broadcast variables and accumulators are used to distribute information across nodes in an Apache Spark cluster.
- Partitioning, caching, and persisting RDDs are advanced techniques for optimising and tuning Apache Spark jobs.
- In-field best practises for working with Apache Spark.
Syllabus :
- Get Started with Apache Spark
- RDD
- Spark Architecture and Components
- Pair RDD
- Advanced Spark Topic
- Spark SQL
- Running Spark in a Cluster
- Additional Learning Materials