Description
Learn how to create data models, data warehouses, and data lakes, as well as how to automate data pipelines and work with large datasets. You will combine your new skills by completing a capstone project at the end of the program.
Syllabus:
Course 1: Data Modeling
Introduction to Data Modeling
- Understand the purpose of data modeling
- Identify the strengths and weaknesses of different types of databases and data storage techniques
- Create a table in Postgres and Apache Cassandra
Relational Data Models
- Understand when to use a relational database
- Understand the difference between OLAP and OLTP databases
- Create normalized data tables
- Implement denormalized schemas (e.g. STAR, Snowflake)
NoSQL Data Models
- Understand when to use NoSQL databases and how they differ from relational databases
- Select the appropriate primary key and clustering columns for a given use case
- Create a NoSQL database in Apache Cassandran
Project 1: Data Modeling with Postgres
In this project, you will model user activity data for Sparkify, a music streaming app. You'll build a relational database and an ETL pipeline to optimise queries for determining which songs users are listening to. You will also define Fact and Dimension tables in PostgreSQL and insert data into your new tables.
Project 2: Data Modeling with Apache Cassandra
In these projects, you will model user activity data for Sparkify, a music streaming app. You'll build a database and ETL pipeline in Postgres and Apache Cassandra to optimise queries for determining which songs users are listening to. You will also define Fact and Dimension tables in PostgreSQL and insert data into your new tables.
Course 2: Cloud Data Warehouses
Introduction to the Data Warehouses
- Understand Data Warehousing architecture
- Run an ETL process to denormalize a database (3NF to Star)
- Create an OLAP cube from facts and dimensions
- Compare columnar vs. row oriented approaches
Introduction to the Cloud with AWS
- Understand cloud computing
- Create an AWS account and understand their services
- Set up Amazon S3, IAM, VPC, EC2, RDS PostgreSQ
Implementing Data Warehouses on AWS
- Identify components of the Redshift architecture
- Run ETL process to extract data from S3 into Redshift
- Set up AWS infrastructure using Infrastructure as Code (IaC)
- Design an optimized table by selecting the appropriate distribution style and sorting key
Project: Build a Cloud Data Warehouse
In this project, you are tasked with creating an ELT pipeline that extracts data from S3, stages it in Redshift, and transforms it into a set of dimensional tables so that their analytics team can continue to discover insights into what songs their users are listening to.
Course 3: Spark and Data Lakes
The Power of Spark
- Understand the big data ecosystem
- Understand when to use Spark and when not to use it LESSON TWO Data Wrangling with Spark
- Manipulate data with SparkSQL and Spark Dataframes
- Use Spark for ETL purposes
Debugging and Optimization
- Troubleshoot common errors and optimize their code using the Spark WebUI LESSON FOUR Introduction to Data Lakes
- Understand the purpose and evolution of data lakes
- Implement data lakes on Amazon S3, EMR, Athena, and Amazon Glue
- Use Spark to run ELT processes and analytics on data of diverse sources, structures, and vintages
- Understand the components and issues of data lakes
Project: Build a Data Lake
You will create an ETL pipeline for a data lake in this project. The data is stored in S3, in a directory of JSON logs on app user activity and a directory of JSON metadata on the songs in the app.
You will load data from S3, process it in analytics tables with Spark, and then load it back into S3. You will use AWS to deploy this Spark process on a cluster.
Course 4: Automate Data Pipelines
Data Pipelines
- Create data pipelines with Apache Airflow
- Set up task dependencies
- Create data connections using hooks
Data Quality
- Track data lineage
- Set up data pipeline schedules
- Partition data to optimize pipelines
- rite tests to ensure data quality
- Backfill data
Production Data Pipelines
- Build reusable and maintainable pipelines
- Build your own Apache Airflow plugins
- Implement subDAGs
- Set up task boundaries
- Monitor data pipelines
Project 1: Data Pipelines with Airflow
In this project, you will continue your work on the data infrastructure of a music streaming company by creating and automating a set of data pipelines. You'll use Airflow to configure and schedule data pipelines, as well as monitor and debug production pipelines.
Project 2: Data Engineering Capstone
The goal of the data engineering capstone project is to allow you to apply what you've learned throughout the programme.
This project will be an important part of your portfolio, assisting you in achieving your data engineering career goals.
In this project, you will define the project's scope as well as the data you will be working with. We will provide guidelines, suggestions, tips, and resources to assist you in your success, but your project will be unique to you. You will collect data from various sources, transform, combine, and summarise it, and create a clean database for others to analyse.