15,000+ Free Udemy Courses to Start Today

Coursesity is supported by learner community. We may earn affiliate commission when you make purchase via links on Coursesity.

Certification Course

Machine Learning: Clustering & Retrieval

Machine Learning: Clustering & Retrieval

Case Studies: Finding Similar Documents A reader is interested in a specific news article and you want to find similar articles to recommend.

68K

total enrollments

4.6

( 2 )

Total ratings

Free

Go to Course SAVE

Course Overview
Reviews

Description

Case Studies: Finding Documents That Are Similar

You want to recommend similar articles to a reader who is interested in a specific news article. What is the correct definition of similarity? What if there are millions of additional documents? Do you have to search through all other documents every time you want to retrieve a new document? How do you group documents that are similar? How do you find out about new and emerging topics covered in the documents?

In this third case study, finding similar documents, you will investigate similarity-based retrieval algorithms. This course will also look at structured representations for describing documents in a corpus, such as clustering and mixed membership models like latent Dirichlet allocation (LDA). You will use expectation maximisation (EM) to learn how to cluster documents and scale the methods using MapReduce.

Learning Objectives: By the end of this course, you should be able to :

Create a document retrieval system based on k-nearest neighbours.
Determine various text data similarity metrics.
Use KD-trees to reduce computations in k-nearest neighbour search.
Use locality sensitive hashing to compute approximate nearest neighbours.
Consider the differences between supervised and unsupervised learning tasks.
Use k-means to group documents by topic.
Explain how to use MapReduce to parallelize k-means.
Consider probabilistic clustering approaches that make use of mixtures models.
Use expectation maximisation to fit a Gaussian mixture model (EM).
Use latent Dirichlet allocation to perform mixed membership modelling (LDA).
Explain the steps of a Gibbs sampler and how to use the output to draw conclusions.
Compare and contrast non-convex optimization initialization techniques.
Python should be used to implement these techniques.

Syllabus :

1. Welcome

Welcome and introduction to clustering and retrieval tasks
Course overview
Module-by-module topics covered
Assumed background

2. Nearest Neighbor Search

Retrieval as k-nearest neighbor search
1-NN algorithm
k-NN algorithm
Document representation
Distance metrics: Euclidean and scaled Euclidean
Writing (scaled) Euclidean distance using (weighted) inner products
Distance metrics: Cosine similarity
To normalize or not and other distance considerations
Complexity of brute force search
KD-tree representation
NN search with KD-trees
Complexity of NN search with KD-trees
Visualizing scaling behavior of KD-trees
Approximate k-NN search using KD-trees
Limitations of KD-trees
LSH as an alternative to KD-trees
Using random lines to partition points
Defining more bins
Searching neighboring bins
LSH in higher dimensions
(OPTIONAL) Improving efficiency through multiple tables
A brief recap

3. Clustering with k-means

The goal of clustering
An unsupervised task
Hope for unsupervised learning, and some challenge cases
The k-means algorithm
k-means as coordinate descent
Smart initialization via k-means++
Assessing the quality and choosing the number of clusters
Motivating MapReduce
The general MapReduce abstraction
MapReduce execution overview and combiners
MapReduce for k-means
Other applications of clustering
A brief recap

4. Mixture Models

Motiving probabilistic clustering models
Aggregating over unknown classes in an image dataset
Univariate Gaussian distributions
Bivariate and multivariate Gaussians
Mixture of Gaussians
Interpreting the mixture of Gaussian terms
Scaling mixtures of Gaussians for document clustering
Computing soft assignments from known cluster parameters
(OPTIONAL) Responsibilities as Bayes' rule
Estimating cluster parameters from known cluster assignments
Estimating cluster parameters from soft assignments
EM iterates in equations and pictures
Convergence, initialization, and overfitting of EM
Relationship to k-means
A brief recap

5. Mixed Membership Modeling via Latent Dirichlet Allocation

Mixed membership models for documents
An alternative document clustering model
Components of latent Dirichlet allocation model
Goal of LDA inference
The need for Bayesian inference
Gibbs sampling from 10,000 feet
A standard Gibbs sampler for LDA
What is collapsed Gibbs sampling?
A worked example for LDA: Initial setup
A worked example for LDA: Deriving the resampling distribution
Using the output of collapsed Gibbs sampling
A brief recap

6. Hierarchical Clustering & Closing Remarks

Module 1 recap
Module 2 recap
Module 3 recap
Module 4 recap
Why hierarchical clustering?
Divisive clustering
Agglomerative clustering
The dendrogram
Agglomerative clustering details
Hidden Markov models
What we didn't cover
Thank you!

Similar Courses

Reviews

No Reviews Available yet

Be the first to write a review

Course Features

Certificate on purchase
17h
Coursera University of Washington (UW)
English
Intermediate
Self Paced
Development ,Machine Learning

Enrollment options

Free Audit

Course Material
Graded Assignment
Practice Quizzes
Certificate on completion

Free Trial

7 - days Free access
$49/month (after free trial)
Cancel subscription anytime
Course Material
Graded Assignment
Practice Quizzes
Project
Certificate on completion

Coursera Plus - Monthly

7 - days free access
$59/month (after trial ends)
Unlimited Access to 3000+ courses
Cancel subscription anytime

Coursera Plus - Annual

$399/year (43% saving)
14 - days refund
Unlimited Access to 3000+ courses

Looking for the financial Help? Apply