Description
In this course, you will :
- Learn the fundamental techniques for cleansing and processing text in R, as well as how to convert text to a format suitable for analytics and predictions.
- begins with a review of text extraction, cleansing, and processing techniques.
- demonstrates how to convert text into an analytics-ready format, including the use of n-grams and TF-IDF
- provides examples of how to use the R and tm libraries to put these techniques to the test.
Syllabus :
1. Introduction to Text Mining
- Purpose
- Document
- Corpus
- R text processing libraries
- Setting up the environment
2. Corpus in R
- PCorpus and VCorpus
- Reading files with CorpusReader
- Exploring the corpus
- Persisting the corpus
3. Text Cleansing and Extraction
- Setup for processing
- Cleansing text
- Stop word removal
- Stemming
- Managing metadata
4. TF-IDF
- Introduction to tf-idf
- Generating term frequency matrix
- Improving term frequency matrix
- Plotting term frequency
- Generating tf-idf
5. N-Grams
- N-grams concepts
- Using RWeka NGramTokenizer
- Creating an n-gram text frequency matrix
- Extracting n-gram pairs
6. Best Practices
- Storing text
- Processing text data
- Scalability