PredictObsCartpole-v0 (experimental)
Like the classic cartpole task [1] but the agent gets extra reward for correctly predicting its next 5 observations. Agents get 0.1 bonus reward for each correct prediction.
Intuitively, a learner that does well on this problem will be able to explain its decisions by projecting the observations that it expects to see as a result of its actions.
This is a toy problem but the principle is useful -- imagine a household robot or a self-driving car that accurately tells you what it expects to percieve after taking a certain plan of action. This'll inspire confidence in the human operator and may allow early intervention if the agent is heading in the wrong direction.
PredictObsCartpole-v0 is an unsolved environment, which means it does not have a specified reward threshold at which it's considered solved.
Note: We don't allow agents to get bonus reward until timestep 100 in each episode. This is to require that agents actually solve the cartpole problem before working on being interpretable. We don't want bad agents just focusing on predicting their own badness.
Prior work has studied prediction in reinforcement learning [Junhyuk15], while other work has explicitly focused on more general notions of interpretability [Maes12]. Outside of reinforcement learning, there is related work on interpretable supervised learning algorithms [Vellido12], [Wang16]. Additionally, predicting poor outcomes and summoning human intervention may be an important part of safe exploration [Amodei16] with oversight [Christiano15]. These predictions may also be useful for penalizing predicted reward hacking [Amodei16]. We hope a simple domain of this nature promotes further investigation into prediction, interpretability, and related properties.
PredictObsCartpole-v0 Evaluations
Algorithm | Best 100-episode performance | Submitted |
---|---|---|
drburke's algorithm writeup | 245.40 ± 0.15 | |
JKCooper2's algorithm | 24.98 ± 1.64 | |
JKCooper2's algorithm | N/A |