Suppose you know that there are 10 balls in an urn, some are red and some are blue. So there are 11 different possible models for this situation:
- M0: 0 red, 10 blue
- M1: 1 red, 9 blue
- ...
- M10: 10 red, 0 blue
Initially we do not know which situation we are in. So a reasonable thing to do would be to assign an equal probability to every model. This is the maximum entropy principle and we use it to set up our prior probability distribution.
Later on we will have learned new information and changed out list of 11 probabilities to more accurately reflect what we have learned. This will hone in a more specific.
size = 10
m = [(i, 1/(size+1)) for i in range(size+1)]Let an event be that we pull a ball out, check if it is red or blue, then put it back in. We will call events draws.
What is the probability of each? 1/2 right? This turns out to be true, but later we will be leaning more towards some specific models than others. So how do we calculate probabilities then?
$P(\text{red} | M_0) = 0$ $P(\text{red} | M_1) = 1/10$ - ...
$P(\text{red} | M_{10}) = 10/10$
and we know the probability of each model so we can sum it up
$P(\text{red}) = \sum_{i} P(M_i) P(\text{red} | M_i)$
def p_red():
return sum([p * i/size for (i,p) in m])
def p_blue():
return 1 - p_red()
p_red()
0.5A mathematician, a physicist, and an engineer are riding a train through Scotland.
The engineer looks out the window, sees a black sheep, and exclaims, "Hey! They've got black sheep in Scotland!"
The physicist looks out the window and corrects the engineer, "Strictly speaking, all we know is that there's at least one black sheep in Scotland."
The mathematician looks out the window and corrects the physicist, " Strictly speaking, all we know is that is that at least one side of one sheep is black in Scotland."
Now we can get to the heart of the problem.
Suppose we perform an event (we take a ball out, look at it and put it back). We learn a couple things. If the ball red is we learn that there is at least one red ball in the urn. This is actually significant though - it means we can completely eliminate model M0. In other words we can assign it probability 0. What probabilities will we assign to the rest of the models? 1/10 seems like a good option. But in fact we pulled a red ball, so perhaps it would be reasonable to lean slightly towards red more than blue.
We can use Bayes theorem to work out the posterior probabilities for each model.
def p_model_given_red(i):
return i/size * m[i][1] / p_red()
def p_model_given_blue(i):
return (1 - i/size) * m[i][1] / p_blue()
[p_model_given_red(i) for i in range(size+1)]
[0.0,
0.018181818181818184,
0.03636363636363637,
0.05454545454545454,
0.07272727272727274,
0.09090909090909091,
0.10909090909090909,
0.12727272727272726,
0.14545454545454548,
0.16363636363636364,
0.18181818181818182]Here is a graph of the new probability distribution:
RED.png
You can see that 0 reds has 0 chance, and all reds is preferred as the most likely model.
What if we drew a red then a blue?
def update_given_red():
return [(i, p_model_given_red(i)) for i in range(size+1)]
def update_given_blue():
return [(i, p_model_given_blue(i)) for i in range(size+1)]
m = update_given_red()
m = update_given_blue()
m
[(0, 0.0),
(1, 0.018181818181818184),
(2, 0.03636363636363637),
(3, 0.05454545454545454),
(4, 0.07272727272727274),
(5, 0.09090909090909091),
(6, 0.10909090909090909),
(7, 0.12727272727272726),
(8, 0.14545454545454548),
(9, 0.16363636363636364),
(10, 0.18181818181818182)]Here is the graph RED-BLUE.png
As is universal in statistics, a bell curve starts to appear.
This was all inspired by a question. What if we drew a red ball 6 times in a row, and then our friend came along and drew a blue. How surprised would be we? Should we accuse them of cheating?
m = [(i, 1/(size+1)) for i in range(size+1)]
m = update_given_red()
m = update_given_red()
m = update_given_red()
m = update_given_red()
m = update_given_red()
m = update_given_red()
m = [(i, 1/(size+1)) for i in range(size+1)]
m = update_given_red()
m = update_given_red()
m = update_given_red()
m = update_given_red()
m = update_given_red()
m = update_given_red()
m
[(0, 0.0),
(1, 5.054576792921572e-07),
(2, 3.234929147469806e-05),
(3, 0.00036847864820398234),
(4, 0.002070354654380676),
(5, 0.007897776238939953),
(6, 0.02358263348505487),
(7, 0.05946659051104296),
(8, 0.13250269788036326),
(9, 0.26862093454070324),
(10, 0.505457679292157)]
p_blue()
0.08611103388841013I'm getting a result of 8%, so a bit less than 1/10. It's believable.
But if we were to tip out the urn and see it only had 1 red ball in it, that would be less than 1 in a million chance, and we'd be very surprised about that.
red-red-red-red-red-red.png
This was a very simple example of the general concept of an agent performing bayesian reasoning under uncertainty.
Probabilities are fundamentally about an agents belief, based on their personal model of the world which is formed by the information they have recieved. Probabilty and entropy are tightly connected.
The key step that enabled us to work intelligently here was applying Bayes theorem to go from forward reasoning to backward reasoning. Many people reading this will have been familiar with Bayes theorem already but what I want to stress is that we applied Bayes theorem to work out probabilities of models of the world.
An agent using this type of bayesian reasoning is able to admit that it only has partial knowledge of the universe, but still do its best based on that. And another fundamental concept is that for anything to get started in the first place we needed a prior probability distribution.
These concepts apply much more generally to any agent that aims to operate intelligently when it does not have absolute perfect information. For more I recommend the book Information Theory, Inference, and Learning Algorithms - David J.C. MacKay


