Want to master your topic faster? Let AI build your personalized course

📚 Smarter courses, 🔍 adaptive quizzes, 🎓 real certificates.

Coursesity is supported by learner community. We may earn affiliate commission when you make purchase via links on Coursesity.

Certification Course

Data Engineering Programs - Become a Data Engineer

Data Engineering is the foundation for the new world of Big Data.

4.6

( 1.1K )

Total ratings

Discount Offer

Go to Course SAVE

Course Overview
Reviews

Description

Learn how to create data models, data warehouses, and data lakes, as well as how to automate data pipelines and work with large datasets. You will combine your new skills by completing a capstone project at the end of the program.

Syllabus:

Course 1: Data Modeling

Introduction to Data Modeling

Understand the purpose of data modeling
Identify the strengths and weaknesses of different types of databases and data storage techniques
Create a table in Postgres and Apache Cassandra

Relational Data Models

Understand when to use a relational database
Understand the difference between OLAP and OLTP databases
Create normalized data tables
Implement denormalized schemas (e.g. STAR, Snowflake)

NoSQL Data Models

Understand when to use NoSQL databases and how they differ from relational databases
Select the appropriate primary key and clustering columns for a given use case
Create a NoSQL database in Apache Cassandran

Project 1: Data Modeling with Postgres

In this project, you will model user activity data for Sparkify, a music streaming app. You'll build a relational database and an ETL pipeline to optimise queries for determining which songs users are listening to. You will also define Fact and Dimension tables in PostgreSQL and insert data into your new tables.

Project 2: Data Modeling with Apache Cassandra

In these projects, you will model user activity data for Sparkify, a music streaming app. You'll build a database and ETL pipeline in Postgres and Apache Cassandra to optimise queries for determining which songs users are listening to. You will also define Fact and Dimension tables in PostgreSQL and insert data into your new tables.

Course 2: Cloud Data Warehouses

Introduction to the Data Warehouses

Understand Data Warehousing architecture
Run an ETL process to denormalize a database (3NF to Star)
Create an OLAP cube from facts and dimensions
Compare columnar vs. row oriented approaches

Introduction to the Cloud with AWS

Understand cloud computing
Create an AWS account and understand their services
Set up Amazon S3, IAM, VPC, EC2, RDS PostgreSQ

Implementing Data Warehouses on AWS

Identify components of the Redshift architecture
Run ETL process to extract data from S3 into Redshift
Set up AWS infrastructure using Infrastructure as Code (IaC)
Design an optimized table by selecting the appropriate distribution style and sorting key

Project: Build a Cloud Data Warehouse

In this project, you are tasked with creating an ELT pipeline that extracts data from S3, stages it in Redshift, and transforms it into a set of dimensional tables so that their analytics team can continue to discover insights into what songs their users are listening to.

Course 3: Spark and Data Lakes

The Power of Spark

Understand the big data ecosystem
Understand when to use Spark and when not to use it LESSON TWO Data Wrangling with Spark
Manipulate data with SparkSQL and Spark Dataframes
Use Spark for ETL purposes

Debugging and Optimization

Troubleshoot common errors and optimize their code using the Spark WebUI LESSON FOUR Introduction to Data Lakes
Understand the purpose and evolution of data lakes
Implement data lakes on Amazon S3, EMR, Athena, and Amazon Glue
Use Spark to run ELT processes and analytics on data of diverse sources, structures, and vintages
Understand the components and issues of data lakes

Project: Build a Data Lake

You will create an ETL pipeline for a data lake in this project. The data is stored in S3, in a directory of JSON logs on app user activity and a directory of JSON metadata on the songs in the app.
You will load data from S3, process it in analytics tables with Spark, and then load it back into S3. You will use AWS to deploy this Spark process on a cluster.

Course 4: Automate Data Pipelines

Data Pipelines

Create data pipelines with Apache Airflow
Set up task dependencies
Create data connections using hooks

Data Quality

Track data lineage
Set up data pipeline schedules
Partition data to optimize pipelines
rite tests to ensure data quality
Backfill data

Production Data Pipelines

Build reusable and maintainable pipelines
Build your own Apache Airflow plugins
Implement subDAGs
Set up task boundaries
Monitor data pipelines

Project 1: Data Pipelines with Airflow

In this project, you will continue your work on the data infrastructure of a music streaming company by creating and automating a set of data pipelines. You'll use Airflow to configure and schedule data pipelines, as well as monitor and debug production pipelines.

Project 2: Data Engineering Capstone

The goal of the data engineering capstone project is to allow you to apply what you've learned throughout the programme.
This project will be an important part of your portfolio, assisting you in achieving your data engineering career goals.

In this project, you will define the project's scope as well as the data you will be working with. We will provide guidelines, suggestions, tips, and resources to assist you in your success, but your project will be unique to you. You will collect data from various sources, transform, combine, and summarise it, and create a clean database for others to analyse.