This tutorial teaches you some essentials of Python and how they can be applied to solve a reinforcement learning problem. The Python essentials that will be covered are:

- statements
- variables
- lists
- classes
- function calls

The problem that we will try to solve is the gambler's problem (Sutton & Barto 1998 https://webdocs.cs.ualberta.ca/~sutton/book/the-book.html). A gambler has X dollars and he can bet on a coin turning heads to obtain 100 dollars. He can stake any amount of money he has, and he can keep on betting until he has no money left. When the coin turns heads he earns as much money as he staked. When the coin turns tails he loses his stake. The coin is bent to the gamblers disadvantage. For instance, the probabily of turning heads might be 0.2, so that the coin will, on average, turn heads only on two out of ten throws. If the gambler does not know whether the coin is bent or not - he loses. Can the gambler lose less if we tell him with which probability the coin is turning heads?

In [ ]:

```
p = 0.4
```

In [ ]:

```
kind = 'random'
```

In [ ]:

```
This statement assigns a string to a variable named kind.
```

A variable can also be a list:

In [ ]:

```
Prob = [ 1 - p, p ]
```

In [ ]:

```
if p < 0 :
kind = 'random'
else :
kind = 'informed'
```

In [ ]:

```
for i in [0, 1]:
print "Probability of coin turning", i, "is", Prob[i], "\n"
```

Note that lists are indexed from 0 to n-1 where n is the length of the list.

Here is an example of a "while loop":

In [ ]:

```
i = 0
while i < 2:
print "Probability of coin turning", i, "is", Prob[i], "\n"
i += 1
```

In the last line we use:

In [ ]:

```
i += 1
```

This is equivalent to:

In [ ]:

```
i = i + 1
```

In [ ]:

```
def sum(stakes):
total = 0
for elem in stakes:
total += elem
return total
print sum([10, 20])
print sum([20, 40, 80])
```

Every list has its own function which appends an element to itself.

In [ ]:

```
stake = [] #empty list
stake.append(10) #adding 10 to the list
```

In [1]:

```
print range(5)
```

In [1]:

```
class Gambler:
def __init__(self, p=-1):
"""
Gambler constructor.
Parameters
----------
p: probability of scoring heads (default: -1 for no knowledge)
kind: gambler type [random|informed]
"""
if p < 0: # the gambler has no knowledge of the coin
self.kind = 'random'
else: # the gambler has the knowledge of the coin
self.kind = 'informed'
#probability of coin turning 0 and 1
self.Prob = [1 - p, p]
#perform strategy learning
self.value_iteration()
def update_allowed_stakes(self, capital):
"""
The allowed stakes for a given capital
"""
stakes = []
# the gambler can only bet if he/she has between 1 and 99 dollars
if capital > 0 and capital < 100:
#the gambler's goal is to reach 100 dollars so only betting upto maximum
for a in range(1, min(capital, 100 - capital) + 1):
stakes.append(a)
return stakes
```

**init**() is called when we create a variable of the class Gambler. This function is called the constructor. p=-1 is a default value if no p is provided.

In [5]:

```
random_gambler = Gambler()
print random_gambler.kind
```

Given a certain capital, the function update_allowed_stakes computes all possible stakes. We use a range function that has the form range(a,b+1) to generate a list of values [a, a+1, ..., b]. This is to set the minimal stake to 1 dollar (a = 1). The maximum stake is either the whole capital or how much is needed to reach 100 dollars, whichever is smaller ( min(capital, 100 - capital) ).

Let's create two gamblers - an uninformed gambler and an informed one. Later, we will compare their performances.

In [ ]:

```
gamblers = [Gambler(p), Gambler()]
```

Write a class Coin which has a variable p that represents the probability of turning heads and is passed through a constructor. Set p to a default value of 0.5 in case it is not provided in the input. The class should also have a function def throw(self) which with probability p produces heads (1) and otherwise produces tails (0). Use the function random() from the library random to get a random number between 0 and 1. You can find the description of the function here https://docs.python.org/2/library/random.html. Add your code to Coin.py.

By editing the code in gambling.py, test how much a gambler that is placing random bets loses by calling the function test_gambler with an instance of a coin, 100 attempts. The default starting capital at each attempt is set to 75 and the default heads probability is set to 0.4. Print how much the gambler has lost.

In [ ]:

```
def reward(self, new_capital):
"""
The gambler reaches the goal when the new cpaital is 100
"""
if new_capital == 100:
return 1
else:
return 0
```

Finding the best strategy will require estimating two lists iteratively. The first list is stake_values, which for every possible stake estimates the expected reward. The second list is capital_values, which for every possible initial capital estimates the expected reward for the whole game.

If he know how to estimate stake_values, the optimal strategy is to place the stake which gives the highest expected reward:

In [ ]:

```
def learned_strategy(self, capital):
"""
Informed strategy
"""
# estimate the value of each action for a given state
stake_values = self.stake_values(capital)
#return the action that has the highest value
return numpy.argmax(stake_values) + 1
```

In practice, we don't have to iterate over all possible new capitals. For every possible stake and capital there are only two possible outcomes:

```
- coin turning heads (with probability p), in which case new_capital is capital + stake; and
- coin turning tails (with probability 1 - p), in which case new_capital is capital - stake.
```

So there are only two possible new_capitals. Edit the function stake_values in Gambler.py by replacing TODOs with adequate statements.

In [ ]:

```
def stake_values(self, capital):
# value for each stake is initially unknown
# TODO: initialise stake_values to empty list
#we iterate over possible stakes
possible_stakes = self.update_allowed_stakes(capital)
#TODO: for every possible stake
#TODO: initialise stake value to zero
#TODO: for every possible coin outcome
#TODO using the function self.update_capital() estimate new_capital
#TODO probability of outcome x ( reward(new_capital) + self.capital_values[new_capital])
#TODO: estimate the reward of new_capital using self.reward function
# add it to self.capital_value of new_capital,
# scale the sum with probability of that outcome and add it to the value
#TODO: add value to the list stake_values
return stake_values
```

$$CapitalValue(Capital)=\underset{Stake}{max}(StakeValue(Stake,Capital))$$

But isn't this a chicken-and-egg problem? If we have the value for each capital, we can estimate the value of the stake and vice versa. But how do we start??

Simple! Assume that all capital values are initially zero and iteratively make updates to the stake values and the capital values until the capital values stop changing. This algorithm is called Value Iteration. Replace TODO with an adeqate statement in the function value_iteration in Gambler.py.

In [ ]:

```
def value_iteration(self):
"""
Core update algorithm
"""
# initial capital values
self.capital_values = numpy.zeros(101)
#initial possible starting capital
capitals = range(101)
while True:
Delta = 0
for i in range(101):
# keep the record of old value for each possible starting capital
v = self.capital_values[i]
#keep initial and final dummy state for zero capital and 100 capital - do not update the value
if i == 0 or i == 100:
continue
# what is the value of each allowed stake for this capital
stake_values = self.stake_values(capitals[i])
#TODO: value of the capital is the value of the stake which has the highest value
#keep track of how much the capital values have changed
Delta = max(Delta, math.fabs(v - self.capital_values[i]))
#if the values have not changed a lot then break
if Delta < math.exp(-20):
break
```

Now check which gambler loses less by testing both gamblers in gambling.py.