Approaches for Sample-efficient Reinforcement Learning

Loading...
Thumbnail Image

License

DOI

Type

dissertation

Journal Title

Journal ISSN

Volume Title

Publisher

Grantor

University of Wisconsin-Milwaukee

Abstract

Reinforcement Learning (RL) is an area of machine learning concerned with how an agent should take actions in order to maximize a reward. In the RL paradigm, an agent is in a state, then takes an action, which takes to it to a new state, and that also yields a reward. This process is repeated multiple times and the goal is to maximize the discounted cumulative reward. The agent's actions are given by a policy, where the RL problem becomes to find the policy parameters such that the discounted cumulative reward is maximized. During learning, the agent interacts in the environment, and those interactions are used to update the policy parameters. One challenge that has prevented RL from being widely used in real-world applications is its low sample-efficiency, i.e., too many environments interactions are needed before the agent produces an acceptable, or even safe, behavior. This is especially an issue in applications where environment interactions and potentially bad outcomes are prohibitively costly, e.g., a self-driving car crashing, or a warehouse robot causing an accident. In this thesis, we present approaches aimed at improving sample efficiency. We identify a trade-off between learning and adaptability, whereby an agent that overly focuses on learning given the current and past interactions loses its ability to adapt to future situations, which ends up hindering long term learning. This trade-off stems from value function estimation, which tends to bias the model parameters towards low-rank solutions, and once low-rank solutions are reached, the model's ability to adapt to new scenarios is greatly decreased. With this insight in mind, we reformulate the RL problem as a constrained optimization problem, where the aim is still maximize the discounted cumulative reward but we add a lower bound on the rank of the solution. These ideas are implemented into an RL algorithm that we coin SIRL (Singular values for RL). We empirically evaluate SIRL against other competitive benchmarks, showing that SIRL achieves better sample efficiency. Another approach used to improve sample-efficiency is to leverage existing datasets, so that the agent does not have to learn from scratch. A promising direction is using existing datasets to learn skills, where a skill is a sequence of actions (e.g., the skill could be open the door, representing the sequence of actions: reach handle, twist it, pull, release handle). In this case, the policy is over skills rather than individual actions. We build upon these ideas, and we generalize to the case when multiple skill lengths are used (e.g., the open door skill has a length of four as it is composed of four actions). We propose a practical RL algorithm that we coin MLSP (Multi-length Skills Priors for RL), and we empirically evaluate that allowing the model to have multi-length skills improves sample efficiency.

Description

Related Material and Data

Citation

Sponsorship

Endorsement

Review

Supplemented By

Referenced By