Reinforcement Learning - Summary and Outlook
Reinforcement Learning: Summary and Outlook This series started with a simple treasure hunt and gradually moved toward deep reinforcement learning. The goal was not to cover every algorithm, but t...
Reinforcement Learning: Summary and Outlook This series started with a simple treasure hunt and gradually moved toward deep reinforcement learning. The goal was not to cover every algorithm, but t...
Advanced Topics: Dyna-Q and Curiosity-Driven Learning Most of the previous methods learn from real experience. The agent acts, observes the result, and updates its policy or value function. This ...
PPO: Proximal Policy Optimization Policy gradient methods update the policy directly. This is flexible, but large updates can damage a policy very quickly. PPO tries to solve this with a simple r...
A3C: Asynchronous Advantage Actor-Critic A3C is an Actor-Critic method that uses multiple workers in parallel. Each worker interacts with its own copy of the environment and sends updates to a sha...
DDPG: Deep RL for Continuous Control DQN works with discrete actions. For example, an agent may choose left or right. Many control problems need continuous actions instead. A robot joint may need...
Actor-Critic: Best of Both Worlds Actor-Critic methods combine two ideas. The actor decides what to do. The critic evaluates how good that decision was. This is useful because learning a policy d...
OpenAI Gym: Your RL Playground Reinforcement learning needs environments. An environment gives the agent observations, receives actions, and returns rewards. OpenAI Gym became popular because it ...
DQN Improvements: Double, Dueling, and Prioritized Experience DQN is a useful starting point for deep reinforcement learning, but it has some common weaknesses. Three popular improvements are Doub...
Enter the Deep: Deep Q-Networks (DQN) Q-Learning works well when the state space is small. In the maze example, we can store a value for every state-action pair in a table. But many problems are ...
SARSA Lambda: Adding Memory with Eligibility Traces Standard SARSA updates one state-action pair at a time. This is simple, but it can be slow when a reward comes much later than the useful action...