In the classic version of the pendulum problem , the agent is given a reward based on (1) the angle of the pendulum, (2) the angular velocity of the pendulum, and (3) the force applied. Agents get increased reward for keeping the pendulum (1) upright, (2) still, and (3) using little force.
In this alternative version, the agent's observed reward is sampled from a Gaussian with mean set to the true reward and standard deviation 3.
Comparing results of this task and the classic pendulum task allow us to measure the how resilient each agent is to reward-distortions of this type.
SemisuperPendulumNoise-v0 is an unsolved environment, which means it does not have a specified reward threshold at which it's considered solved.
While classic reinforcement learning problems often include stochastic reward functions, in this setting there is a true (possibly deterministic) reward function, but the signal observed by the agent is noisy. The goal of the agent is to maximize the true reward function given just the noisy signal.
Prior work has explored learning algorithms for human training scenarios of this flavor [Lopes11].
Additionally, Baird and others have noted the relationship between update noise, timestep size, and convergence rate for Q-learners [Baird94].
Robustness to noisy rewards may aid scalable oversight in settings where evaluating the true reward signal is expensive or impossible but a noisy approximation is available [Amodei16], [Christiano15].