Case Studies: Finding Documents That Are Similar
You want to recommend similar articles to a reader who is interested in a specific news article. What is the correct definition of similarity? What if there are millions of additional documents? Do you have to search through all other documents every time you want to retrieve a new document? How do you group documents that are similar? How do you find out about new and emerging topics covered in the documents?
In this third case study, finding similar documents, you will investigate similarity-based retrieval algorithms. This course will also look at structured representations for describing documents in a corpus, such as clustering and mixed membership models like latent Dirichlet allocation (LDA). You will use expectation maximisation (EM) to learn how to cluster documents and scale the methods using MapReduce.
Learning Objectives: By the end of this course, you should be able to :
- Create a document retrieval system based on k-nearest neighbours.
- Determine various text data similarity metrics.
- Use KD-trees to reduce computations in k-nearest neighbour search.
- Use locality sensitive hashing to compute approximate nearest neighbours.
- Consider the differences between supervised and unsupervised learning tasks.
- Use k-means to group documents by topic.
- Explain how to use MapReduce to parallelize k-means.
- Consider probabilistic clustering approaches that make use of mixtures models.
- Use expectation maximisation to fit a Gaussian mixture model (EM).
- Use latent Dirichlet allocation to perform mixed membership modelling (LDA).
- Explain the steps of a Gibbs sampler and how to use the output to draw conclusions.
- Compare and contrast non-convex optimization initialization techniques.
- Python should be used to implement these techniques.
- Welcome and introduction to clustering and retrieval tasks
- Course overview
- Module-by-module topics covered
- Assumed background
2. Nearest Neighbor Search
- Retrieval as k-nearest neighbor search
- 1-NN algorithm
- k-NN algorithm
- Document representation
- Distance metrics: Euclidean and scaled Euclidean
- Writing (scaled) Euclidean distance using (weighted) inner products
- Distance metrics: Cosine similarity
- To normalize or not and other distance considerations
- Complexity of brute force search
- KD-tree representation
- NN search with KD-trees
- Complexity of NN search with KD-trees
- Visualizing scaling behavior of KD-trees
- Approximate k-NN search using KD-trees
- Limitations of KD-trees
- LSH as an alternative to KD-trees
- Using random lines to partition points
- Defining more bins
- Searching neighboring bins
- LSH in higher dimensions
- (OPTIONAL) Improving efficiency through multiple tables
- A brief recap
3. Clustering with k-means
- The goal of clustering
- An unsupervised task
- Hope for unsupervised learning, and some challenge cases
- The k-means algorithm
- k-means as coordinate descent
- Smart initialization via k-means++
- Assessing the quality and choosing the number of clusters
- Motivating MapReduce
- The general MapReduce abstraction
- MapReduce execution overview and combiners
- MapReduce for k-means
- Other applications of clustering
- A brief recap
4. Mixture Models
- Motiving probabilistic clustering models
- Aggregating over unknown classes in an image dataset
- Univariate Gaussian distributions
- Bivariate and multivariate Gaussians
- Mixture of Gaussians
- Interpreting the mixture of Gaussian terms
- Scaling mixtures of Gaussians for document clustering
- Computing soft assignments from known cluster parameters
- (OPTIONAL) Responsibilities as Bayes' rule
- Estimating cluster parameters from known cluster assignments
- Estimating cluster parameters from soft assignments
- EM iterates in equations and pictures
- Convergence, initialization, and overfitting of EM
- Relationship to k-means
- A brief recap
5. Mixed Membership Modeling via Latent Dirichlet Allocation
- Mixed membership models for documents
- An alternative document clustering model
- Components of latent Dirichlet allocation model
- Goal of LDA inference
- The need for Bayesian inference
- Gibbs sampling from 10,000 feet
- A standard Gibbs sampler for LDA
- What is collapsed Gibbs sampling?
- A worked example for LDA: Initial setup
- A worked example for LDA: Deriving the resampling distribution
- Using the output of collapsed Gibbs sampling
- A brief recap
6. Hierarchical Clustering & Closing Remarks
- Module 1 recap
- Module 2 recap
- Module 3 recap
- Module 4 recap
- Why hierarchical clustering?
- Divisive clustering
- Agglomerative clustering
- The dendrogram
- Agglomerative clustering details
- Hidden Markov models
- What we didn't cover
- Thank you!