Description
The design of statistical experiments and analytics are at the heart of data science. You will design statistical experiments and analyze the results using modern methods in this course. You will also look at common pitfalls in interpreting statistical arguments, particularly those involving big data. This course will help you internalise a core set of practical and effective machine learning methods and concepts, which you will then apply to solve real-world problems.
By the end of this course, you will be able to:
- Design effective experiments and analyse the results
- Make clear and bulletproof statistical arguments without resorting to esoteric notation by employing resampling methods.
- Explain and apply a core set of classification methods (rules, trees, random forests) and associated optimization methods of increasing complexity (gradient descent and variants).
- Explain and apply a set of concepts and methods for unsupervised learning.
- Describe the most common idioms used in large-scale graph analytics, such as structural queries, traversals and recursive queries, PageRank, and community detection.
Syllabus :
1. Practical Statistical Inference
- Appetite Whetting: Bad Science
- Hypothesis Testing
- Significance Tests and P-Values
- Deriving the Sampling Distribution
- Shuffle Test for Significance
- Comparing Classical and Resampling Methods
- Bootstrap
- Resampling Caveats
- Outliers and Rank Transformation
- Bad Science Revisited: Publication Bias
- Effect Size
- Meta-analysis
- Fraud and Benford's Law
- Intuition for Benford's Law
- Benford's Law Explained Visually
- Multiple Hypothesis Testing: Bonferroni and Sidak Corrections
- Multiple Hypothesis Testing: False Discovery Rate
- Multiple Hypothesis Testing: Benjamini-Hochberg Procedure
- Big Data and Spurious Correlations
- Spurious Correlations: Stock Price Example
- How is Big Data Different?
- Bayesian vs. Frequentist
- Motivation for Bayesian Approaches
- Bayes' Theorem
- Applying Bayes' Theorem
- Naive Bayes: Spam Filtering
2. Supervised Learning
- Statistics vs. Machine Learning
- Simple Examples
- Structure of a Machine Learning Problem
- Classification with Simple Rules
- Learning Rules
- Rules: Sequential Covering
- Rules Recap
- From Rules to Trees
- Entropy
- Measuring Entropy
- Using Information Gain to Build Trees
- Building Trees: ID3 Algorithm
- Building Trees: C.45 Algorithm
- Rules and Trees Recap
- Overfitting
- Evaluation: Leave One Out Cross Validation
- Accuracy and ROC Curves
- Bootstrap Revisited
- Ensembles, Bagging, Boosting
- Boosting Walkthrough
- Random Forests
- Random Forests: Variable Importance
- Summary: Trees and Forests
- Nearest Neighbor
- Similarity Functions
- Curse of Dimensionality
3. Optimization
- Optimization by Gradient Descent
- Gradient Descent Visually
- Gradient Descent in Detail
- Gradient Descent: Questions to Consider
- Intuition for Logistic Regression
- Intuition for Support Vector Machines
- Support Vector Machine Example
- Intuition for Regularization
- Intuition for LASSO and Ridge Regression
- Stochastic and Batched Gradient Descent
- Parallelizing Gradient Descent
4. Unsupervised Learning
- Introduction to Unsupervised Learning
- K-means
- DBSCAN
- DBSCAN Variable Density and Parallel Algorithms