1. Introduction

1.1. About this library

This library provides several algorithms from the field of machine learning and statistics. A short introduction into statistics is given in the next chapter. The goal of this documentation is to provide a theoretical background additional to the implementation. Main topics covered so far are sampling methods, model fitting and model selection. The model fitting module provides various many different learning algorithms. Where possible, it was tried to stick to a statistics and information theory framework.

Besides the python standard library, the framework mainly uses scipy and numpy. Although not all algorithms rely on them, it’s strongly encouraged to use those libraries.

Other projects that might be useful in what you try to achieve can be found at scikit. Especially noteworthy is the package scikit-learn that covers a lot of machine learning algorithms. For statistical computation, there’s a nice list of projects as StatPy. Noteworthy are also the SciTools package, the MatPlotLib for plotting and the convex optimization package cvxopt.

If you’re looking for more theoretical information, note the used Resources. On the web, many topics are well covered by wikipedia and also the StatSoft Statistics textbook might be useful.

1.1.1. Type notations

It’s important to understand the type signatures. It may help a lot to understand the structure of the code.

Wherever a common type like float or int (or a type that mimics it) is expected, this is noted using the python syntax. Where a list is expected, also the respective python symbol is used. For exampe, [[float]] would be a matrix of floats whereas (float) is a tuple of floats. Lists and tuples are usually exchangable but it’s strongly recommended to use the specified one. The syntax for functions and unspecified types is haskell-like. If the there are no restriction on a type, a character is paced instead. Several types are concatenated using an arrow (->) and the last type in the chain is the return value. Consider the example below:

>>> (Num b) :: [a] -> b -> a

This signature defines a function that takes a list of an arbitrary type and a second parameter of (possibly) another type and returns a value of the same type as the first parameter’s list type. Additionally, if a type is requested to behave like a nummerical type (int or float), this is indicated as above for the type b. Several restrictions are seperated by comma. Besides the nummerical restriction, also Ord (orderd type) may appear. A function with this signature could be as simple as

>>> lambda lst,idx: lst[idx]

Usually, the type symbols are coherent within all methods of a class and often also with related classes (i.e. classes in the same document).

If no type signature is given, the type is completely arbitrary. However, note that this is usually the case for abstract methods and type restrictions may be found in their concrete implementations.

1.2. Statistics formulary

1.2.1. Combinatorics

Draws of objects from an urn. There are n objects and k draws.

  Ordered Unordered
With replacement n^k (n+k-1)!/((n-1)!*k!)
Without replacement n!/(n-k)! n!/((n-k)!*k!)

1.2.2. Random variables and probability mass function

Consider an experiment with several possible outcomes where the outcome depends on some underlying random process. A prominent example is a coin toss where the possible results are head and tail. The set of all the experiment’s outcomes is the sample space \mathcal{S}. Depending on the nature of the experiment, it may be either discrete or continous.

Such an experiment can be expressed through a random variable X. A random variable can be viewed as a function that assigns a value (often a real number) to any element of the sample space \mathcal{S}.

X: \mathcal{S} \rightarrow \mathcal{A}

If the sample space \mathcal{S} is continous, then so is the random variable X. If \mathcal{S} is countable, then X is discrete. Note that for the space of a variable, always the calligraphic sign (e.g. \mathcal{S}, \mathcal{X}, \mathcal{Y}) is used.

The probability mass function P: \mathcal{A} \rightarrow [0,1] gives the probability of seeing a specific outcome of a random variable. If the random variable is continous, this probability has to be zero, so P(X=x) = 0. The probability of seeing any possible outcome must be one, hence

\sum\limits_{x \in \mathcal{S}} P(X=x) = 1

\int\limits_{x \in \mathcal{S}} P(X=x) dx = 1

Usually, the random variable is denoted with an uppercase letter (X) whereas a lower case letter (x) is used for a specific value of the variable. Also, in order to make mathematical terms more readable, the following notation is used:

P(x) := P(X=x)

Given two random variables X,Y, the probability that the respective outcomes are x and y are observed at the same time is called the joint probability P(x,y) := P(X=x,Y=y).

Two random variables are independent, if and only if

P(x,y) = P(x) P(y)

holds. Independence means that the outcome of one random variable does not influence the outcome of the other one. If two random variables are independent, one can easily find the probability of one of the two variables by summing or integration over the other one, i.e. P(x) = \sum\limits_y P(x,y) if the variables are discrete.

The conditional probability P(x|y) is the probability of seeing a certain outcome x, given that the outcome of the other random variable was already observed to be y.

P(x|y) = P(X=x \, | \, Y=y) := \frac{P(x,y)}{P(y)}

For independent variables, it can be seen that P(x|y) = P(x), which intuitively makes sense as y doesn’t influence x.

The law of total probability, also called marginalization, states the following:

P(x) = \sum\limits_{y \in \mathcal{Y}} P(x|y) P(y)

P(x) = \int\limits_{y \in \mathcal{Y}} P(x|y) P(y) dy

1.2.3. Bayes

The Bayes theorem states the following:

P(x|y) = \frac{ P(y|x) P(x) }{ P(y) }

The term P(x) is the prior probability, as its independent of y and can thus be interpreted as the probability of an outcome of X where no information about the second variable Y is available. In this view, the conditional probability P(x|y) is the posterior, so the probability after some information was already observed. The denominator P(y) is for normalization, hence also called normalizer.

Using the law of total probability, the theorem can be re-written into the following form:

P(x|y) = \frac{ P(y|x) P(x) }{ \sum\limits_{x' \in \mathcal{X}} P(y|x') P(x') }

If there are three random variables, the probability of the outcome x, given that the two outcomes y,z are observed is

P(x|y,z) = \frac{ P(y|x,z) P(x|z) }{ P(y|z) }

Two variables X,Y are conditionally independent, iff

P(x,y|z) = P(x|z) P(y|z)

given a third random variable Z. It follows that

P(x|z) = P(x|y,z)

P(y|z) = P(y|x,z)

In robotics, the term Belief is sometime used. Let u_1,z_1, \dots, u_t,z_t be a sequence of measurements z_t and actions u_t up to time t. The belief expressed the probability of being in the current state x_t:

\text{Bel}(x_t) = P(x_t \, | \, u_1, z_1, \dots, u_t, z_t)

1.2.4. Probability distributions

The term probability distribution is used to describe the collection of the probabilities for all possible outcomes. When talking about a probability distribution, usally the probability mass or density functions are referred to.

Remember the probability mass function (pmf) P(X=x), already defined above. It expresses how probable a certain outcome is and it holds that

\sum\limits_{x in \mathcal{X}} P(X=x) = 1

In the discrete case, also the probability of several outcomes can be computed easily by

P(X \in A) = \sum\limits_{x \in A} P(X=x)

For discrete P(X), the probability of any element is non-negative and strictly positive for at least one x \in \mathcal{X}. On the other hand, if P(X) is continous, the probability of one exact value goes to zero, hence P(X=x) = 0 \, \forall x. In consequence, for continous functions we define

P(a \leq X \leq b) = \int\limits_a^b p(x) dx

with p(x) the probability density function (pdf). Analogous to the pmf, the probability density function describes how likely a certain outcome is for the random varaible. The pdf is also non-negative all probabilities sums up to one, \int\limits_{-\infty}^{\infty} p(x) dx = 1.

Further, we can define the cumulative distribution function (cdf):

F(x) := P(X \leq x) = \int\limits_{-\infty}^x f(t) dt

which expresses the probability that the outcome of the experiment will be no larger than x. Note the limits \lim\limits_{x \rightarrow -\infty} F(x) = 0 and \lim\limits_{x \rightarrow \infty} F(x) = 1. From this definition, it’s easily seen that

F(b) - F(a) = P(a \leq X \leq b) = \int\limits_a^b f(x) dx

If P(X) isn’t continous at all places x, the probability at the discontinuity is

P(X=x) = F(x) - \lim\limits_{y \rightarrow x-} F(y)

which of course again goes to zero if P(X) is continuous at x.

For sampling, the inverse cdf would be important. Unfortunately, a closed form representation is often not available. The inverse cdf is defined as

F^{-1}(y) := \inf\limits_{x} F(x) \geq y \qquad y \in [0,1]

In other words this means that for a given y, find the value x such that F(x) = y.

There are several well known distributions defined in the section Distribution Wrappers and also in the scipy library.

1.2.5. Statistics

Consider a set of data points. Likely, one wants to give some quantitative description of their shape. Also, one would like to be able to give some characterisations of known and well defined probability distributions.

For this, let’s first define the Moment

\mu_k := E[ X^k ]

using the definition of the Expected value (for discrete and continous random variables)

E[X] = \sum\limits_{x} x P(x)

E[X] = \int\limits_{x} x f(x) dx

Note that f(x) is the probability density function of X. The expected value is sometimes referenced as mean and misleadingly associated with the arithmetic mean. Of course this is only true, if the probability of the observations is uniform (all observations are equally likely) which is often the case for measured data (then the probability distribution is expressed by repetition of measurements).

The expected value is linear, a fact that is formally noted by the following equation:

E[aX + bY + c] = a \, E[X] + b \, E[Y] + c

If the two random variables X,Y are independent, it also holds that

E[X \, Y] = E[X] \, E[Y]

The Central Moment is the moment about the expected value, thus

\mu_k := E[ (X-E[X])^k ]

From its definition, one can easily see that the zero’th central moment is one and the first central moment is zero.

\mu_0 = 1

\mu_1 = 0

The second central moment is the Variance

\text{Var}[X] &:=& E[ (X-E[X])^2 ] \\
       & =& \sum\limits_{x} (x-\mu)^2 P(x) \\
       & =& \int\limits_{x} (x-\mu)^2 f(x) dx

again f(x) is the probability density function and \mu = E[X] the expected value. The equations show the definition for a discrete and continous random variable, respectively. An alternative and often more appropriate formulation is

\text{Var}[X] = E[X^2] - (E[X])^2

The variance is pseudo-linear, which means the following:

\text{Var}[aX+b] = a^2 \text{Var}[X]

A generalisation of the variance to multiple random variables yields the Covariance, which is defined analogously

\text{Cov}[X,Y] = E[ \left( X-E[X] \right) \left( Y-E[Y] \right) ]

For the sake of completeness but without further comment, here are some properties of the covariance:

\text{Cov}[X,X] &=& \text{Var}[X] \\
\text{Cov}[X,Y] &=& E[X \, Y] - E[X] \, E[Y] \\
\text{Cov}[X,Y] &=& \text{Cov}[Y,X] \\
\text{Cov}[aX + b, Y] &=& a \text{Cov}[X,Y] \\
\text{Cov}[X+Y,Z] &=& \text{Cov}[X,Z] + \text{Cov}[Y,Z]