Description
In this course, you will :
- introduces a systematic approach to predictive modeling's data understanding phase
- teaches principles, guidelines, and tools like KNIME and R for properly assessing a data set's suitability for machine learning
- Learn how to collect data, describe data, explore data using bivariate visualisations, and verify data quality before moving on to the data preparation phase.
- For improved knowledge retention, the course includes case studies and best practises, as well as challenge and solution sets.
- By the end, you should have acquired the knowledge and skills required to pay close attention to this critical phase of all successful data science projects.
Syllabus :
1. What Is Data Assessment?
- Clarifying how data understanding differs from data visualization
- Introducing the critical data understanding phase of CRISP-DM
- Data assessment in CRISP-DM alternatives: The IBM ASUM-DM and Microsoft TDSP
- Navigating the transition from business understanding to data understanding
- How to organize your work with the four data understanding tasks
2. Collect Initial Data
- Considerations in gathering the relevant data
- A strategy for processing data sources
- Getting creative about data sources
- How to envision a proper flat file
- Anticipating data integration
3. First Look at the Data
- Reviewing basic concepts in the level of measurement
- What is dummy coding?
- Expanding our definition of level of measurement
- Taking an initial look at possible key variables
- Dealing with duplicate IDs and transactional data
- How many potential variables (columns) will I have?
- How to deal with high-order multiple nominals
4. Data Loading and Unit of Analysis
- Introducing the KNIME Analytics Platform
- Tips and tricks to consider during data loading
- Unit analysis decisions
5. Describe Data
- How to uncover the gross properties of the data
- Researching the dataset
- Tips and tricks using simple aggregation commands
- A simple strategy for organizing your work
6. Data Description Case Studies
- Describe data demo using the UCI heart dataset
7. Explore Data Basics
- The explore data task
- How to be effective doing univariate analysis and data visualization
- Anscombe's quartet
- The Data Explorer node feature in KNIME
- How to navigate borderline cases of variable type
- How to be effective in doing bivariate data visualization
8. Explore Data Tips and Tricks
- How to utilize an SME's time effectively
- Techniques for working with the top predictors
- Advice for weak predictors
- Tips and tricks when searching for quirks in your data
- Learning when to discard rows
- Introducing ggplot2
- Orientating to R's ggplot2 for powerful multivariate data visualizations
9. Verify Data Quality
- Exploring your missing data options
- Why you lose rows to listwise deletion
- Investigating the provenance of the missing data
10. Missing Data Case Study
- Introducing the KDD Cup 1998 data
- What is the pattern of missing data in your data?
- Is the missing data worth saving?
- Assessing imputation as a potential solution
11. Explore and Verify Case Studies
- Exploring and verifying data quality with the UCI heart dataset
12. Making the Transition to Data Preparation
- Why formal reports are important
- Creating a data prep to-do list
- How to prepare for eventual deployment