Description
In this course, you will learn :
- How to use R and the tidyverse to identify and address many of the data integrity issues that modern data scientists face.
- How to deal with missing values and duplicated data.
- How to convert data between units and how to deal with badly formatted text.
- How to spot outliers, deal with structural issues, and spot red flags that indicate potential data quality issues.
Syllabus :
1. Missing Data
- Types of missing data
- Missing values
- Missing rows
- Aggregations and missing values
2. Duplicated Data
- Duplicated rows and values
- Aggregations in the data set
3. Formatting Data
- Converting dates
- Unit conversions
- Numbers stored as text
- Text improperly converted to numbers
- Inconsistent spellings
4. Outliers
- Screening for outliers
- Handling outliers
- Outliers use case
- Outliers in subgroups
- Detecting illogical values
5. Tidy Data
- What is tidy data?
- Variables, observations, and values
- Common data problems
- Wide vs. long data sets
- Making wide data sets long
- Making long data sets wide
6. Red Flags
- Suspicious values
- Suspicious multiples