My first Reinforcement Learning code

This tutorial teaches you some essentials of Python and how they can be applied to solve a reinforcement learning problem. The Python essentials that will be covered are:

  • statements
  • variables
  • lists
  • classes
  • function calls

The problem that we will try to solve is the gambler's problem (Sutton & Barto 1998 A gambler has X dollars and he can bet on a coin turning heads to obtain 100 dollars. He can stake any amount of money he has, and he can keep on betting until he has no money left. When the coin turns heads he earns as much money as he staked. When the coin turns tails he loses his stake. The coin is bent to the gamblers disadvantage. For instance, the probabily of turning heads might be 0.2, so that the coin will, on average, turn heads only on two out of ten throws. If the gambler does not know whether the coin is bent or not - he loses. Can the gambler lose less if we tell him with which probability the coin is turning heads?


The code consists of statements that perform some operations on inputs and produce outputs. The simplest form of statement is an assignment of a value to a variable.

In [ ]:
p = 0.4

This statement means that we are assigning the value of 0.4 to the variable p. Next time we use the variable p in some other statement it will be carrying that value.

In [ ]:
kind = 'random'
In [ ]:
This statement assigns a string to a variable named kind.  

A variable can also be a list:

In [ ]:
Prob = [ 1 - p, p ]

Variables can be more complex objects, but we will not talk about them just yet. First let's introduce other kinds of statements. An "if statement" consists of a logical expression that can be either true or false. In case it is true, one set of statements is executed. In case it is false, another set of statements is executed.

In [ ]:
if p < 0 :
    kind = 'random'
else :
    kind = 'informed'

Another important kind of statements are loops. In case we want to execute several statements that are similar, instead of typing them one after another we can apply loops. Here is an example of a "for loop":

In [ ]:
for i in [0, 1]:
    print "Probability of coin turning", i, "is", Prob[i], "\n"

Note that lists are indexed from 0 to n-1 where n is the length of the list.

Here is an example of a "while loop":

In [ ]:
i = 0
while i < 2:
    print "Probability of coin turning", i, "is", Prob[i], "\n"
    i += 1

In the last line we use:

In [ ]:
i += 1

This is equivalent to:

In [ ]:
i = i + 1

And it means that we are taking the value of i, adding 1 to it, and that the resulting value becomes the new value of i.

A function is a part of the code which processes some inputs and returns outputs. This part of code can be called anywhere with the function name and inputs. The advantage of using a function is that we do not have to duplicate a piece of code. Also, the code is more readable if it is divided into a number of functions.

In [ ]:
def sum(stakes):
    total = 0
    for elem in stakes:
        total += elem
    return total

print sum([10, 20])
print sum([20, 40, 80])

There are a number of built-in functions that one can use. In fact sum() is one of them. Also, a large number of functions can be accessed via libraries.

Every list has its own function which appends an element to itself.

In [ ]:
stake = [] #empty list
stake.append(10) #adding 10 to the list

Another useful built-in function is range(n). For a given integer it returns all values from zero up to n - 1.

In [1]:
print range(5)
[0, 1, 2, 3, 4]

We have already seen that lists can have their own functions. Objects that can have their own variables and functions are defined as classes. Their variables are defined as self.variable and their functions are called as self.function_name(). A class consists of number of functions.

In [1]:
class Gambler:
    def __init__(self, p=-1):
        Gambler constructor.

        p:      probability of scoring heads (default: -1 for no knowledge)
        kind:   gambler type [random|informed] 
        if p < 0:   # the gambler has no knowledge of the coin
            self.kind = 'random'
        else:       # the gambler has the knowledge of the coin
            self.kind = 'informed'

            #probability of coin turning 0 and 1
            self.Prob = [1 - p, p]

            #perform strategy learning

    def update_allowed_stakes(self, capital):
        The allowed stakes for a given capital
        stakes = []

        # the gambler can only bet if he/she has between 1 and 99 dollars
        if capital > 0 and capital < 100:
            #the gambler's goal is to reach 100 dollars so only betting upto maximum
            for a in range(1, min(capital, 100 - capital) + 1):
        return stakes

The function init() is called when we create a variable of the class Gambler. This function is called the constructor. p=-1 is a default value if no p is provided.

In [5]:
random_gambler = Gambler()
print random_gambler.kind

Given a certain capital, the function update_allowed_stakes computes all possible stakes. We use a range function that has the form range(a,b+1) to generate a list of values [a, a+1, ..., b]. This is to set the minimal stake to 1 dollar (a = 1). The maximum stake is either the whole capital or how much is needed to reach 100 dollars, whichever is smaller ( min(capital, 100 - capital) ).

Let's create two gamblers - an uninformed gambler and an informed one. Later, we will compare their performances.

In [ ]:
gamblers = [Gambler(p), Gambler()]

Exercise 1:

Write a class Coin which has a variable p that represents the probability of turning heads and is passed through a constructor. Set p to a default value of 0.5 in case it is not provided in the input. The class should also have a function def throw(self) which with probability p produces heads (1) and otherwise produces tails (0). Use the function random() from the library random to get a random number between 0 and 1. You can find the description of the function here Add your code to

By editing the code in, test how much a gambler that is placing random bets loses by calling the function test_gambler with an instance of a coin, 100 attempts. The default starting capital at each attempt is set to 75 and the default heads probability is set to 0.4. Print how much the gambler has lost.

Exercise 2a:

In the second part of the exercise an informed gambler will try to figure out for himself which actions to take so that he loses less. His goal is to obtain 100 dollars. We will define this goal in terms of a numerical reward. The gambler will award himself 1 point when he gets 100 dollars.

In [ ]:
    def reward(self, new_capital):
        The gambler reaches the goal when the new cpaital is 100
        if new_capital == 100:
            return 1
            return 0

Finding the best strategy will require estimating two lists iteratively. The first list is stake_values, which for every possible stake estimates the expected reward. The second list is capital_values, which for every possible initial capital estimates the expected reward for the whole game.

If he know how to estimate stake_values, the optimal strategy is to place the stake which gives the highest expected reward:

In [ ]:
    def learned_strategy(self, capital):
        Informed strategy

        # estimate the value of each action for a given state
        stake_values = self.stake_values(capital)

        #return the action that has the highest value
        return numpy.argmax(stake_values) + 1

The last line of this code requires further explanation. numpy is a useful library to perform numerical and scientific computing. Here we use a simple function called argmax. We have seen before that max() returns the highest value in a list. argmax() returns the index of the highest value. Why are we adding 1 here? Well, actions are dollars that the gambler stakes they go from 1 rather than from zero.

Let's try to write a function self.stake_values(capital) assuming that we are given self.capital_value for every capital. The key equation that we are trying to implement is:


In practice, we don't have to iterate over all possible new capitals. For every possible stake and capital there are only two possible outcomes:

- coin turning heads (with probability p), in which case new_capital is capital + stake; and
- coin turning tails (with probability 1 - p), in which case new_capital is capital - stake.

So there are only two possible new_capitals. Edit the function stake_values in by replacing TODOs with adequate statements.

In [ ]:
    def stake_values(self, capital):

        # value for each stake is initially unknown
        # TODO: initialise stake_values to empty list

        #we iterate over possible stakes
        possible_stakes = self.update_allowed_stakes(capital)

        #TODO: for every possible stake

            #TODO: initialise stake value to zero

            #TODO: for every possible coin outcome
                #TODO using the function self.update_capital() estimate new_capital

                #TODO probability of outcome x ( reward(new_capital) + self.capital_values[new_capital])
                #TODO: estimate the reward of new_capital using self.reward function 
                # add it to self.capital_value of new_capital,
                # scale the sum with probability of that outcome and add it to the value

            #TODO: add value to the list stake_values

        return stake_values

Exercise 2b:

Now that we have found the value of each stake, the value of each capital is simply the stake that has the highest value


But isn't this a chicken-and-egg problem? If we have the value for each capital, we can estimate the value of the stake and vice versa. But how do we start??

Simple! Assume that all capital values are initially zero and iteratively make updates to the stake values and the capital values until the capital values stop changing. This algorithm is called Value Iteration. Replace TODO with an adeqate statement in the function value_iteration in

In [ ]:
    def value_iteration(self):
        Core update algorithm

        # initial capital values
        self.capital_values = numpy.zeros(101)

        #initial possible starting capital
        capitals = range(101)

        while True:

            Delta = 0

            for i in range(101):
                # keep the record of old value for each possible starting capital
                v = self.capital_values[i]

                #keep initial and final dummy state for zero capital and 100 capital - do not update the value
                if i == 0 or i == 100:

                # what is the value of each allowed stake for this capital
                stake_values = self.stake_values(capitals[i])

                #TODO: value of the capital is the value of the stake which has the highest value

                #keep track of how much the capital values have changed
                Delta = max(Delta, math.fabs(v - self.capital_values[i]))

            #if the values have not changed a lot then break
            if Delta < math.exp(-20):

Now check which gambler loses less by testing both gamblers in