- This game presents moves along a linear chain of states, with two actions:
- forward, which moves along the chain but returns no reward
- backward, which returns to the beginning and has a small reward
The end of the chain, however, presents a large reward, and by moving 'forward' at the end of the chain this large reward can be repeated.
At each action, there is a small probability that the agent 'slips' and the opposite transition is instead taken.
The observed state is the current state in the chain (0 to n-1).
This environment is described in section 6.1 of: A Bayesian Framework for Reinforcement Learning by Malcolm Strens (2000) http://ceit.aut.ac.ir/~shiry/lecture/machine-learning/papers/BRL-2000.pdf