The goal of the game is to effective use the reward provided in order to understand the best action to take.
After each step the agent receives an observation of: 0 - No guess yet submitted (only after reset) 1 - Guess is lower than the target 2 - Guess is equal to the target 3 - Guess is higher than the target
The rewards is calculated as: ((min(action, self.number) + self.bounds) / (max(action, self.number) + self.bounds)) ** 2 This is essentially the squared percentage of the way the agent has guessed toward the target.
Ideally an agent will be able to recognise the 'scent' of a higher reward and increase the rate in which is guesses in that direction until the reward reaches its maximum.
It is possible to reach the maximum reward within 2 steps if an agent is capable of learning the reward dynamics (one to determine the direction of the target, the second to jump directly to the target based on the reward).