TF-Agents Tutorial

Reinforcement Learning with TF-Agents

Reinforcement learning (RL) is a general framework where agents learn to perform actions in an environment so as to maximize a reward. The two main components are the environment, which represents the problems to be solved, and the agent, which represents the learning algorithm.

The agent and environment continuously interact with each other. At each time step, the agent takes an action on the environment based on its policy , where is the current observation from the environment, and receives a reward and the next observation from the environment. The goal is to improve the policy so as to maximize the sum of rewards.

Introduction

TF-Agents makes designing, implementing and testing new RL algorithms easier, by providing well tested modular components that can be modified and extended. It enables fast code iteration, with good test integration and benchmarking.

Cartpole Environment

The Cartpole environment is one of the most well known classic RL.

  • The observation from the environment st is a 4D vector representing the position and velocity of the cart, and the angle and angular velocity of the pole.
  • The agent can control the system by taking one of 2 actions at: push the cart right (+1) or left (-1).
  • A reward rt+1=1 is provided for every timestep that the pole remains upright. The episode ends when one of the following is true:
    • the pole tips over some angle limit
    • the cart moves outside of the world edges
    • 200 time steps pass.

DQN Agent

The DQN (Deep Q-Network) algorithm was developed by DeepMind in 2015. It was able to solve a wide range of Atari games (some to superhuman level) by combining reinforcement learning and deep neural networks at scale. The algorithm was developed by enhancing a classic RL algorithm called Q-Learning with deep neural networks and a technique called experience replay.

Setup

Dependencies

Hyperparameters

In the Cartpole environment:

  • observation is an array of 4 floats:
    • the position and velocity of the cart
    • the angular position and velocity of the pole
  • reward is a scalar float value
  • action is a scalar integer with only two possible values:
    • 0 β€” "move left"
    • 1 β€” "move right"

Training the Agent

It will take ~7 minutes to run

22