Description
Hello and welcome to the Reinforcement Learning class.
Here you will learn about:
- foundations of RL methods: value/policy iteration, q-learning, policy gradient, etc.
--- with math & batteries included - using deep neural networks for RL tasks
--- also known as "the hype train"
- state of the art RL algorithms
--- and how to apply duct tape to them for practical problems.
- and, of course, teaching your neural network to play games
--- because that's what everyone thinks RL is about. We'll also use it for seq2seq and contextual bandits.
Syllabus :
1. Intro: why should I care?
- About the University
- Why should you care
- Reinforcement learning vs all
- Multi-armed bandit
- Decision process & applications
- Markov Decision Process
- Crossentropy method
- Approximate crossentropy method
- More on approximate crossentropy method
- Evolution strategies: core idea
- Evolution strategies: math problems
- Evolution strategies: log-derivative trick
- Evolution strategies: duct tape
- Blackbox optimization: drawbacks
2. At the heart of RL: Dynamic Programming
- Reward design
- State and Action Value Functions
- Measuring Policy Optimality
- Policy: evaluation & improvement
- Policy and value iteration
3. Model-free methods
- Model-based vs model-free
- Monte-Carlo & Temporal Difference; Q-learning
- Exploration vs Exploitation
- Footnote: Monte-Carlo vs Temporal Difference
- Accounting for exploration. Expected Value SARSA
- On-policy vs off-policy; Experience replay
4. Approximate Value Based Methods
- Supervised & Reinforcement Learning
- Loss functions in value based RL
- Difficulties with Approximate Methods
- DQN – bird's eye view
- DQN – the internals
- DQN: statistical issues
- Double Q-learning
- More DQN tricks
- Partial observability
5. Policy-based methods
- Intuition
- All Kinds of Policies
- Policy gradient formalism
- The log-derivative trick
- REINFORCE
- Advantage actor-critic
- Duct tape zone
- Policy-based vs Value-based
- Case study: A3C
- Combining supervised & reinforcement learning
6. Exploration
- Recap: bandits
- Regret: measuring the quality of exploration
- The message just repeats. 'Regret, Regret, Regret.'
- Intuitive explanation
- Thompson Sampling
- Optimism in face of uncertainty
- UCB-1
- Bayesian UCB
- Introduction to planning
- Monte Carlo Tree Search