Description
In this course you will learn:
- With the increased amount of data publicly available and the increased focus on unstructured text data, understanding how to clean, process, and analyze that text data is tremendously valuable
- If you have some experience with Python and an interest in natural language processing (NLP), this course can provide you with the knowledge you need to tackle complex problems using machine learning
- instructors Derek Jedamski provides a quick summary of basic natural language processing (NLP) concepts, covers advanced data cleaning and vectorization techniques, and then takes a deep dive into building machine learning classifiers.
- how to build two different types of machine learning models, as well as how to evaluate and test variations of those models.
Syllabus:
- Welcome
- What you should know
- What tools do you need?
- Using the exercise files
1. NLP Basics
- What are NLP and NLTK?
- NLTK setup and overview
- Reading in text data
- Exploring the dataset
- What are regular expressions?
- Learning how to use regular expressions
- Regular expression replacements
- Machine learning pipeline
- Implementation: Removing punctuation
- Implementation: Tokenization
- Implementation: Removing stop words
- Chapter Quiz
2. Supplemental Data Cleaning
- Introducing stemming
- Using stemming
- Introducing lemmatizing
- Using lemmatizing
- Chapter Quiz
3. Vectorizing Raw Data
- Introducing vectorizing
- Count vectorization
- N-gram vectorizing
- Inverse document frequency weighting
- Chapter Quiz
4. Feature Engineering
- Introducing feature engineering
- Feature creation
- Feature evaluation
- Identifying features for transformation
- Box-Cox power transformation
- Chapter Quiz
5. Building Machine Learning Classifiers
- What is machine learning?
- Cross-validation and evaluation metrics
- Introducing random forest
- Building a random forest model
- Random forest with holdout test set
- Random forest model with grid search
- Evaluate random forest model performance
- Introducing gradient boosting
- Gradient-boosting grid search
- Evaluate gradient-boosting model performance
- Model selection: Data prep
- Model selection: Results
- Chapter Quiz