Description
In this course, you will build classifiers that perform admirably on a variety of tasks. You will become acquainted with the most successful and widely used techniques in practise, such as logistic regression, decision trees, and boosting. Furthermore, you will be able to design and implement the underlying algorithms for learning these models at scale using stochastic gradient ascent. These techniques will be applied to real-world, large-scale machine learning tasks. You will also cover important tasks that you will encounter in real-world ML applications, such as dealing with missing data and measuring precision and recall to evaluate a classifier. This course is hands-on, action-packed, and packed with visualisations and illustrations of how these techniques will perform on real-world data. We've also included optional content in each module that covers advanced topics for those who want to delve even deeper!
Syllabus :
1. (A) Welcome!
- Welcome to the classification course, a part of the Machine Learning Specialization
- What is this course about?
- Impact of classification
- Course overview
- Outline of first half of course
- Outline of second half of course
- Assumed background
- Let's get started!
2. Linear Classifiers & Logistic Regression
- Linear classifiers: A motivating example
- Intuition behind linear classifiers
- Decision boundaries
- Linear classifier model
- Effect of coefficient values on decision boundary
- Using features of the inputs
- Predicting class probabilities
- Review of basics of probabilities
- Review of basics of conditional probabilities
- Using probabilities in classification
- Predicting class probabilities with (generalized) linear models
- The sigmoid (or logistic) link function
- Logistic regression model
- Effect of coefficient values on predicted probabilities
- Overview of learning logistic regression models
- Encoding categorical inputs
- Multiclass classification with 1 versus all
- Recap of logistic regression classifier
2. (A) Learning Linear Classifiers
- Goal: Learning parameters of logistic regression
- Intuition behind maximum likelihood estimation
- Data likelihood
- Finding best linear classifier with gradient ascent
- Review of gradient ascent
- Learning algorithm for logistic regression
- Example of computing derivative for logistic regression
- Interpreting derivative for logistic regression
- Summary of gradient ascent for logistic regression
- Choosing step size
- Careful with step sizes that are too large
- Rule of thumb for choosing step size
- (VERY OPTIONAL) Deriving gradient of logistic regression: Log trick
- (VERY OPTIONAL) Expressing the log-likelihood
- (VERY OPTIONAL) Deriving probability y=-1 given x
- (VERY OPTIONAL) Rewriting the log likelihood into a simpler form
- (VERY OPTIONAL) Deriving gradient of log likelihood
- Recap of learning logistic regression classifiers
(B) Overfitting & Regularization in Logistic Regression
- Evaluating a classifier
- Review of overfitting in regression
- Overfitting in classification
- Visualizing overfitting with high-degree polynomial features
- Overfitting in classifiers leads to overconfident predictions
- Visualizing overconfident predictions
- (OPTIONAL) Another perspecting on overfitting in logistic regression
- Penalizing large coefficients to mitigate overfitting
- L2 regularized logistic regression
- Visualizing effect of L2 regularization in logistic regression
- Learning L2 regularized logistic regression with gradient ascent
- Sparse logistic regression with L1 regularization
- Recap of overfitting & regularization in logistic regression
3. Decision Trees
- Predicting loan defaults with decision trees
- Intuition behind decision trees
- Task of learning decision trees from data
- Recursive greedy algorithm
- Learning a decision stump
- Selecting best feature to split on
- When to stop recursing
- Making predictions with decision trees
- Multiclass classification with decision trees
- Threshold splits for continuous inputs
- (OPTIONAL) Picking the best threshold to split on
- Visualizing decision boundaries
- Recap of decision trees
4. (A) Preventing Overfitting in Decision Trees
- A review of overfitting
- Overfitting in decision trees
- Principle of Occam's razor: Learning simpler decision trees
- Early stopping in learning decision trees
- (OPTIONAL) Motivating pruning
- (OPTIONAL) Pruning decision trees to avoid overfitting
- (OPTIONAL) Tree pruning algorithm
- Recap of overfitting and regularization in decision trees
(B) Handling Missing Data
- Challenge of missing data
- Strategy 1: Purification by skipping missing data
- Strategy 2: Purification by imputing missing data
- Modifying decision trees to handle missing data
- Feature split selection with missing data
- Recap of handling missing data
5. Boosting
- The boosting question
- Ensemble classifiers
- Boosting
- AdaBoost overview
- Weighted error
- Computing coefficient of each ensemble component
- Reweighing data to focus on mistakes
- Normalizing weights
- Example of AdaBoost in action
- Learning boosted decision stumps with AdaBoost
- The Boosting Theorem
- Overfitting in boosting
- Ensemble methods, impact of boosting & quick recap
6. Precision-Recall
- Case-study where accuracy is not best metric for classification
- What is good performance for a classifier?
- Precision: Fraction of positive predictions that are actually positive
- Recall: Fraction of positive data predicted to be positive
- Precision-recall extremes
- Trading off precision and recall
- Precision-recall curve
- Recap of precision-recall
7. Scaling to Huge Datasets & Online Learning
- Gradient ascent won't scale to today's huge datasets
- Timeline of scalable machine learning & stochastic gradient
- Why gradient ascent won't scale
- Stochastic gradient: Learning one data point at a time
- Comparing gradient to stochastic gradient
- Why would stochastic gradient ever work?
- Convergence paths
- Shuffle data before running stochastic gradient
- Choosing step size
- Don't trust last coefficients
- (OPTIONAL) Learning from batches of data
- (OPTIONAL) Measuring convergence
- (OPTIONAL) Adding regularization
- The online learning task
- Using stochastic gradient for online learning
- Scaling to huge datasets through parallelization & module recap