Description
Learn how to process data in real-time using modern data engineering tools like Apache Spark, Kafka, Spark Streaming, and Kafka Streaming. You'll begin by learning about the components of data streaming systems. After that, you'll create a real-time analytics application. Students will also collect data, run analytics, and derive insights from reports generated by the streaming console.
Syllabus:
Course 1: Foundations of Data Streaming, and SQL & Data Modeling for the Web
Introduction to Stream Processing
- Describe and explain streaming data stores and stream processing
- Describe and explain real-world usages of stream processing
- Describe and explain append-only logs, events, and how stream processing differs from batch processing
- Utilize Kafka CLI tools and the Confluent Kafka Python library for topic management, production, and consumption
Apache Kafka
- Describe and explain Kafka architecture
- Describe and explain Kafka topics and configuration
- Utilize Confluent Kafka Python to create topics and configuration
- Describe and explain Kafka producers, consumers, and configuration
- Utilize Confluent Kafka Python to create producers and configuration
- Utilize Confluent Kafka Python to create topics, configuration, and manage offsets
- Describe and explain user privacy considerations
- Describe and explain performance monitoring for consumers, producers, and the cluster itself
Data Schemas and Apache Avro
- Describe and explain what a data schema is and what value it provides
- Describe and explain what Apache Avro is and what value it provides
- Utilize AvroProducer and AvroConsumer in Confluent Kafka Python
- Describe and explain schema evolution and data compatibility types
- Utilize Schema Registry components in Confluent Kafka Python to manage compatibility
Kafka Connect and REST Proxy
- Describe and explain what problem Kafka Connect solves for and where it would be more appropriate than a traditional consumer
- Describe and explain common connectors and how they work
- Utilize Kafka Connect FIleStream Source and Sink
- Utilize Kafka Connect JDBC Source and Sink
- Describe and explain what problem Kafka REST Proxy solves for and where it would be more appropriate than alternatives
- Describe and explain the REST Proxy metadata and administrative APIs
- Utilize the REST Proxy administrative and metadata APIs
- Describe and explain the REST Proxy consumer APIs
- Utilize the REST Proxy consumer, subscription, and offset APIs
- ]Describe and explain the REST Proxy producer APIs
- Utilize the REST Proxy producer APIs
Stream Processing Fundamentals
- Describe and explain common scenarios for stream processing, and where you would use stream versus batch
- Describe and explain common stream processing strategies
- Describe and explain how time and windowing works in stream processing
- Describe and explain what a stream versus a table is in stream processing, and where you would use on over the other
- Describe and explain how data storage works in stream processing applications and why it is needed
Stream Processing with Faust
- Describe and explain the Faust Stream Processing Python library, and how it fits into the ecosystem relative to solutions like Kafka Streams
- Describe and explain Faust stream-based processing
- Utilize Faust to create a stream-based application
- Describe and explain how Faust table-based processing works
- Utilize Faust to create a table-based application
- Describe and explain Faust processors and function usage
- Utilize Faust processor and function
- Describe and explain Faust serialization and deserialization
- Utilize Faust serialization and deserialization
KSQL
- Describe and explain how KSQL fits into the Kafka ecosystem, and why you would choose it over a stream processing application built from scratch
- Describe and explain KSQL architecture
- Describe and explain how to create KSQL streams and tables from topics. Understand the importance of KEY and schema transformations.
- Utilize KSQL to create tables and streams
- Describe and explain KSQL selection syntax
- Utilize KSQL syntax to query tables and streams
- Describe and explain KSQL windowing
- Utilize KSQL windowing within the context of table analysis
- Describe and explain KSQL grouping and aggregates
- Utilize KSQL grouping and aggregates within queries
Project: Optimize Chicago Bus and Train Availability Using Kafka
In your first project, you will stream public transit status using Kafka and the Kafka ecosystem to create a stream processing application that displays train status in real-time. You will be able to optimise the availability of buses and trains in Chicago based on streaming data if you learn the skills. You will learn how to have your own Python code generate events, how to use REST Proxy to send events over HTTP, and how to use Kafka Connect to collect data from a Postgres database to produce streaming data into Kafka from a variety of sources. Then, using KSQL, you will combine related data models into a single topic ready for consumption by downstream Python applications, and you will finish a simple Python application that ingests data from the Kafka topics for analysis.Finally, the Faust Python Stream Processing library will be used to further transform train station data into a more streamlined representation: This library will use stateful processing to determine whether passenger volume is increasing, decreasing, or remaining constant.
Course 2: Streaming API Development and Documentation
Streaming DataFrames
- Start a Spark Cluster and Deploy a Spark Application
- Create a Spark Streaming DataFrame with a Kafka Source
- Create a Spark View • Query a Spark View
Joins and JSON
- Parse a JSON Payload Into Separate Fields for Analysis
- Join Two Streaming DataFrames from Different Data Sources
- Write a Streaming DataFrame to Kafka with Aggregated Data
Redis, Base64, and JSON
- Manually Save to Redis and Read the Same Data from a Kafka Topic
- Parse Base64 Encoded Information
- Sink a Subset of JSON Fields
Project: Evaluate Human Balance with Spark Streaming
- You will be working with a real-world application called the Step Trending Electronic Data Interface in this project (STEDI). It is a functional application that is used to assess the risk of seniors falling. When a senior takes a test, they are scored using an index that reflects the likelihood of falling and possibly injuring themselves while walking. STEDI stores risk scores and other data in a Redis datastore. At a STEDI clinic, the Data Science team has completed a working graph for population risk. The issue is that the data has not yet been populated. You will create a Kafka topic containing anonymized risk scores of seniors in the clinic using Kafka Connect Redis Source events and Business Events.