Contents

This library provides several algorithms from the field of machine learning and statistics. A short introduction into statistics is given in the next chapter. The goal of this documentation is to provide a theoretical background additional to the implementation. Main topics covered so far are sampling methods, model fitting and model selection. The model fitting module provides various many different learning algorithms. Where possible, it was tried to stick to a statistics and information theory framework.

Besides the python standard library, the framework mainly uses scipy and numpy. Although not all algorithms rely on them, it’s strongly encouraged to use those libraries.

Other projects that might be useful in what you try to achieve can be found at scikit. Especially noteworthy is the package scikit-learn that covers a lot of machine learning algorithms. For statistical computation, there’s a nice list of projects as StatPy. Noteworthy are also the SciTools package, the MatPlotLib for plotting and the convex optimization package cvxopt.

If you’re looking for more theoretical information, note the used *Resources*.
On the web, many topics are well covered by
wikipedia and also the StatSoft Statistics
textbook might be useful.

It’s important to understand the type signatures. It may help a lot to understand the structure of the code.

Wherever a common type like float or int (or a type that mimics it) is expected, this is noted using the python syntax. Where a list is expected, also the respective python symbol is used. For exampe, [[float]] would be a matrix of floats whereas (float) is a tuple of floats. Lists and tuples are usually exchangable but it’s strongly recommended to use the specified one. The syntax for functions and unspecified types is haskell-like. If the there are no restriction on a type, a character is paced instead. Several types are concatenated using an arrow (->) and the last type in the chain is the return value. Consider the example below:

```
>>> (Num b) :: [a] -> b -> a
```

This signature defines a function that takes a list of an arbitrary type and a second parameter of (possibly) another type and returns a value of the same type as the first parameter’s list type. Additionally, if a type is requested to behave like a nummerical type (int or float), this is indicated as above for the type b. Several restrictions are seperated by comma. Besides the nummerical restriction, also Ord (orderd type) may appear. A function with this signature could be as simple as

```
>>> lambda lst,idx: lst[idx]
```

Usually, the type symbols are coherent within all methods of a class and often also with related classes (i.e. classes in the same document).

If no type signature is given, the type is completely arbitrary. However, note that this is usually the case for abstract methods and type restrictions may be found in their concrete implementations.

Draws of objects from an urn. There are n objects and k draws.

Ordered | Unordered | |

With replacement | n^k | (n+k-1)!/((n-1)!*k!) |

Without replacement | n!/(n-k)! | n!/((n-k)!*k!) |

Consider an experiment with several possible outcomes where the outcome depends on some underlying random
process. A prominent example is a coin toss where the possible results are head and tail.
The set of all the experiment’s outcomes is the **sample space** .
Depending on the nature of the experiment, it may be either discrete or continous.

Such an experiment can be expressed through a **random variable** .
A random variable can be viewed as a function that assigns a value (often a real number) to any
element of the sample space .

If the sample space is continous, then so is the random variable . If is countable, then is discrete. Note that for the space of a variable, always the calligraphic sign (e.g. ) is used.

The **probability mass function** gives the probability of
seeing a specific outcome of a random variable. If the random variable is continous, this probability has
to be zero, so . The probability of seeing any possible outcome must be one, hence

Usually, the random variable is denoted with an uppercase letter () whereas a lower case letter () is used for a specific value of the variable. Also, in order to make mathematical terms more readable, the following notation is used:

Given two random variables , the probability that the respective outcomes are and
are observed at the same time is called the **joint probability** .

Two random variables are **independent**, if and only if

holds. Independence means that the outcome of one random variable does not influence the outcome of the other one. If two random variables are independent, one can easily find the probability of one of the two variables by summing or integration over the other one, i.e. if the variables are discrete.

The **conditional probability** is the probability of seeing a certain outcome ,
given that the outcome of the other random variable was already observed to be .

For independent variables, it can be seen that , which intuitively makes sense as doesn’t influence .

The law of **total probability**, also called **marginalization**, states the following:

The **Bayes theorem** states the following:

The term is the prior probability, as its independent of and can thus be interpreted as the probability of an outcome of where no information about the second variable is available. In this view, the conditional probability is the posterior, so the probability after some information was already observed. The denominator is for normalization, hence also called normalizer.

Using the law of total probability, the theorem can be re-written into the following form:

If there are three random variables, the probability of the outcome , given that the two outcomes are observed is

Two variables are **conditionally independent**, iff

given a third random variable . It follows that

In robotics, the term **Belief** is sometime used. Let be a sequence
of measurements and actions up to time . The
belief expressed the probability of being in the current state :

The term **probability distribution** is used to describe the collection of the probabilities
for all possible outcomes. When talking about a probability distribution, usally the probability
mass or density functions are referred to.

Remember the **probability mass function** (pmf) , already defined above. It expresses how
probable a certain outcome is and it holds that

In the discrete case, also the probability of several outcomes can be computed easily by

For discrete , the probability of any element is non-negative and strictly positive for at least one . On the other hand, if is continous, the probability of one exact value goes to zero, hence . In consequence, for continous functions we define

with the **probability density function** (pdf). Analogous to the pmf, the
**probability density function** describes how likely a certain outcome is for the
random varaible. The pdf is also non-negative all probabilities sums up to one,
.

Further, we can define the **cumulative distribution function** (cdf):

which expresses the probability that the outcome of the experiment will be no larger than . Note the limits and . From this definition, it’s easily seen that

If isn’t continous at all places , the probability at the discontinuity is

which of course again goes to zero if is continuous at .

For sampling, the **inverse cdf** would be important. Unfortunately, a closed form
representation is often not available. The inverse cdf is defined as

In other words this means that for a given , find the value such that .

There are several well known distributions defined in the section *Distribution Wrappers*
and also in the scipy library.

Consider a set of data points. Likely, one wants to give some quantitative description of their shape. Also, one would like to be able to give some characterisations of known and well defined probability distributions.

For this, let’s first define the **Moment**

using the definition of the **Expected value** (for discrete and continous random variables)

Note that is the probability density function of . The expected value is sometimes referenced as mean and misleadingly associated with the arithmetic mean. Of course this is only true, if the probability of the observations is uniform (all observations are equally likely) which is often the case for measured data (then the probability distribution is expressed by repetition of measurements).

The expected value is linear, a fact that is formally noted by the following equation:

If the two random variables are independent, it also holds that

The **Central Moment** is the moment about the expected value, thus

From its definition, one can easily see that the zero’th central moment is one and the first central moment is zero.

The second central moment is the **Variance**

again is the probability density function and the expected value. The equations show the definition for a discrete and continous random variable, respectively. An alternative and often more appropriate formulation is

The variance is pseudo-linear, which means the following:

A generalisation of the variance to multiple random variables yields the **Covariance**, which
is defined analogously

For the sake of completeness but without further comment, here are some properties of the covariance: