.. _intro:

Introduction
************

.. contents::

.. _intro-about:

About this library
==================

This library provides several algorithms from the field of machine learning and statistics.
A short introduction into statistics is given in the next chapter. The goal of this
documentation is to provide a theoretical background additional to the implementation.
Main topics covered so far are sampling methods, model fitting and model selection.
The model fitting module provides various many different learning algorithms.
Where possible, it was tried to stick to a statistics and information theory framework.

Besides the `python standard library <http://docs.python.org/library/index.html>`_,
the framework mainly uses `scipy <http://www.scipy.org/>`_ and
`numpy <http://numpy.scipy.org/>`_. Although not all algorithms rely on them, it's
strongly encouraged to use those libraries.

Other projects that might be useful in what you try to achieve can be found at
`scikit <http://scikits.appspot.com/>`_. Especially noteworthy is the package
`scikit-learn <http://scikit-learn.org/stable/>`_ that covers a lot of machine
learning algorithms. For statistical computation, there's a nice list of projects
as `StatPy <http://www.astro.cornell.edu/staff/loredo/statpy/>`_. Noteworthy are
also the `SciTools <http://code.google.com/p/scitools/>`_ package, the
`MatPlotLib <http://matplotlib.sourceforge.net/>`_ for plotting and the convex
optimization package `cvxopt <http://abel.ee.ucla.edu/cvxopt/>`_.

If you're looking for more theoretical information, note the used :ref:`resources`.
On the web, many topics are well covered by
`wikipedia <http://en.wikipedia.org/wiki/Main_Page>`_ and also the `StatSoft Statistics
textbook <http://www.statsoft.com/textbook/>`_ might be useful.

.. _intro-type:

Type notations
--------------

It's important to understand the type signatures. It may help a lot to understand the structure of
the code.

Wherever a common type like float or int (or a type that mimics it) is expected, this is noted using
the python syntax.
Where a list is expected, also the respective python symbol is used. For exampe, [[float]] would be a
matrix of floats whereas (float) is a tuple of floats. Lists and tuples are usually exchangable but it's
strongly recommended to use the specified one.
The syntax for functions and unspecified types is haskell-like. If the there are no restriction on a type,
a character is paced instead. Several types are concatenated using an arrow (->) and the last type in the
chain is the return value. Consider the example below:

    >>> (Num b) :: [a] -> b -> a

This signature defines a function that takes a list of an arbitrary type and a second parameter of (possibly)
another type and returns a value of the same type as the first parameter's list type. Additionally, if a type
is requested to behave like a nummerical type (int or float), this is indicated as above for the type b.
Several restrictions are seperated by comma. Besides the nummerical restriction, also Ord (orderd type) may
appear. A function with this signature could be as simple as

    >>> lambda lst,idx: lst[idx]

Usually, the type symbols are coherent within all methods of a class and often also with related classes
(i.e. classes in the same document). 

If no type signature is given, the type is completely arbitrary. However, note that this is usually the case
for abstract methods and type restrictions may be found in their concrete implementations.

.. _intro-statistics:

Statistics formulary
====================

.. _intro-combinatorics:

Combinatorics
-------------

Draws of objects from an urn. There are n objects and k draws.

+---------------------+-----------+----------------------+
|                     |  Ordered  |       Unordered      |
+---------------------+-----------+----------------------+
| With replacement    |    n^k    | (n+k-1)!/((n-1)!*k!) |
+---------------------+-----------+----------------------+
| Without replacement | n!/(n-k)! |    n!/((n-k)!*k!)    |
+---------------------+-----------+----------------------+

.. _intro-rv:

Random variables and probability mass function
----------------------------------------------

Consider an experiment with several possible outcomes where the outcome depends on some underlying random
process. A prominent example is a coin toss where the possible results are head and tail.
The set of all the experiment's outcomes is the **sample space** :math:`\mathcal{S}`.
Depending on the nature of the experiment, it may be either discrete or continous.

Such an experiment can be expressed through a **random variable** :math:`X`.
A random variable can be viewed as a function that assigns a value (often a real number) to any
element of the sample space :math:`\mathcal{S}`.

.. math::
    X: \mathcal{S} \rightarrow \mathcal{A}

If the sample space :math:`\mathcal{S}` is continous, then so is the random
variable :math:`X`. If :math:`\mathcal{S}` is countable, then :math:`X` is discrete. Note that for
the space of a variable, always the calligraphic sign (e.g. :math:`\mathcal{S}, \mathcal{X}, \mathcal{Y}`)
is used.

The **probability mass function** :math:`P: \mathcal{A} \rightarrow [0,1]` gives the probability of
seeing a specific outcome of a random variable. If the random variable is continous, this probability has
to be zero, so :math:`P(X=x) = 0`. The probability of seeing any possible outcome must be one, hence

.. math::
    \sum\limits_{x \in \mathcal{S}} P(X=x) = 1
    
    \int\limits_{x \in \mathcal{S}} P(X=x) dx = 1

Usually, the random variable is denoted with an uppercase letter (:math:`X`) whereas a lower case letter (:math:`x`) is
used for a specific value of the variable. Also, in order to make mathematical terms more readable, the following
notation is used:

.. math::
    P(x) := P(X=x)

Given two random variables :math:`X,Y`, the probability that the respective outcomes are :math:`x` and
:math:`y` are observed at the same time is called the **joint probability** :math:`P(x,y) := P(X=x,Y=y)`.

.. _intro-independence:

Two random variables are **independent**, if and only if

.. math::
    P(x,y) = P(x) P(y)

holds. Independence means that the outcome of one random variable does not influence the outcome of the other one.
If two random variables are independent, one can easily find the probability of one of the two variables by summing
or integration over the other one, i.e. :math:`P(x) = \sum\limits_y P(x,y)` if the variables are discrete.

The **conditional probability** :math:`P(x|y)` is the probability of seeing a certain outcome :math:`x`,
given that the outcome of the other random variable was already observed to be :math:`y`.

.. math::
    P(x|y) = P(X=x \, | \, Y=y) := \frac{P(x,y)}{P(y)}

For independent variables, it can be seen that :math:`P(x|y) = P(x)`, which intuitively makes sense as :math:`y`
doesn't influence :math:`x`.

The law of **total probability**, also called **marginalization**, states the following:

.. math::
    P(x) = \sum\limits_{y \in \mathcal{Y}} P(x|y) P(y)
    
    P(x) = \int\limits_{y \in \mathcal{Y}} P(x|y) P(y) dy


.. _intro-bayes:

Bayes
-----

The **Bayes theorem** states the following:

.. math::
    P(x|y) = \frac{ P(y|x) P(x) }{ P(y) }

The term :math:`P(x)` is the prior probability, as its independent of :math:`y` and can thus be interpreted
as the probability of an outcome of :math:`X` where no information about the second variable :math:`Y` is available.
In this view, the conditional probability :math:`P(x|y)` is the posterior, so the probability after some information
was already observed. The denominator :math:`P(y)` is for normalization, hence also called normalizer.

Using the law of total probability, the theorem can be re-written into the following form:

.. math::
    P(x|y) = \frac{ P(y|x) P(x) }{ \sum\limits_{x' \in \mathcal{X}} P(y|x') P(x') }

If there are three random variables, the probability of the outcome :math:`x`, given that the two outcomes
:math:`y,z` are observed is

.. math::
    P(x|y,z) = \frac{ P(y|x,z) P(x|z) }{ P(y|z) }

Two variables :math:`X,Y` are **conditionally independent**, iff

.. math::
    P(x,y|z) = P(x|z) P(y|z)

given a third random variable :math:`Z`. It follows that

.. math::
    P(x|z) = P(x|y,z)
    
    P(y|z) = P(y|x,z)

In robotics, the term **Belief** is sometime used. Let :math:`u_1,z_1, \dots, u_t,z_t` be a sequence
of measurements :math:`z_t` and actions :math:`u_t` up to time :math:`t`. The
belief expressed the probability of being in the current state :math:`x_t`:

.. math::
    \text{Bel}(x_t) = P(x_t \, | \, u_1, z_1, \dots, u_t, z_t)

.. _intro-dists:

Probability distributions
-------------------------

The term **probability distribution** is used to describe the collection of the probabilities
for all possible outcomes. When talking about a probability distribution, usally the probability
mass or density functions are referred to.

Remember the **probability mass function** (pmf) :math:`P(X=x)`, already defined above. It expresses how
probable a certain outcome is and it holds that

.. math::
    \sum\limits_{x in \mathcal{X}} P(X=x) = 1

In the discrete case, also the probability of several outcomes can be computed easily by

.. math::
    P(X \in A) = \sum\limits_{x \in A} P(X=x)

For discrete :math:`P(X)`, the probability of any element is non-negative and strictly
positive for at least one :math:`x \in \mathcal{X}`. On the other hand, if :math:`P(X)`
is continous, the probability of one exact value goes to zero, hence :math:`P(X=x) = 0 \, \forall x`.
In consequence, for continous functions we define

.. math::
    P(a \leq X \leq b) = \int\limits_a^b p(x) dx

with :math:`p(x)` the **probability density function** (pdf). Analogous to the pmf, the
**probability density function** describes how likely a certain outcome is for the
random varaible. The pdf is also non-negative all probabilities sums up to one,
:math:`\int\limits_{-\infty}^{\infty} p(x) dx = 1`.

Further, we can define the **cumulative distribution function** (cdf):

.. math::
    F(x) := P(X \leq x) = \int\limits_{-\infty}^x f(t) dt

which expresses the probability that the outcome of the experiment will be no larger than :math:`x`.
Note the limits :math:`\lim\limits_{x \rightarrow -\infty} F(x) = 0` and :math:`\lim\limits_{x \rightarrow \infty} F(x) = 1`.
From this definition, it's easily seen that

.. math::
    F(b) - F(a) = P(a \leq X \leq b) = \int\limits_a^b f(x) dx

If :math:`P(X)` isn't continous at all places :math:`x`, the probability at the discontinuity is

.. math::
    P(X=x) = F(x) - \lim\limits_{y \rightarrow x-} F(y)

which of course again goes to zero if :math:`P(X)` is continuous at :math:`x`.

For sampling, the **inverse cdf** would be important. Unfortunately, a closed form
representation is often not available. The inverse cdf is defined as

.. math::
    F^{-1}(y) := \inf\limits_{x} F(x) \geq y \qquad y \in [0,1]

In other words this means that for a given :math:`y`, find the value :math:`x`
such that :math:`F(x) = y`.

There are several well known distributions defined in the section :ref:`core-dists`
and also in the `scipy library <http://docs.scipy.org/doc/scipy/reference/stats.html>`_.

.. _intro-stats:

Statistics
----------

Consider a set of data points. Likely, one wants to give some quantitative description of their shape.
Also, one would like to be able to give some characterisations of known and well defined probability
distributions.

For this, let's first define the **Moment**

.. math::
    \mu_k := E[ X^k ]

.. _intro-ev:

using the definition of the **Expected value** (for discrete and continous random variables)

.. math::
    E[X] = \sum\limits_{x} x P(x)
    
    E[X] = \int\limits_{x} x f(x) dx

Note that :math:`f(x)` is the probability density function of :math:`X`. The expected value is sometimes referenced as mean
and misleadingly associated with the arithmetic mean. Of course this is only true, if the probability of the observations
is uniform (all observations are equally likely) which is often the case for measured data (then the probability distribution
is expressed by repetition of measurements).

The expected value is linear, a fact that is formally noted by the following equation:

.. math::
    E[aX + bY + c] = a \, E[X] + b \, E[Y] + c

If the two random variables :math:`X,Y` are independent, it also holds that

.. math::
    E[X \, Y] = E[X] \, E[Y]

The **Central Moment** is the moment about the expected value, thus

.. math::
    \mu_k := E[ (X-E[X])^k ]

From its definition, one can easily see that the zero'th central moment is one and the first central moment is zero.

.. math::
    \mu_0 = 1
    
    \mu_1 = 0

.. _intro-var:

The second central moment is the **Variance**

.. math::
    :nowrap:
    
    \begin{eqnarray*}
    \text{Var}[X] &:=& E[ (X-E[X])^2 ] \\
           & =& \sum\limits_{x} (x-\mu)^2 P(x) \\
           & =& \int\limits_{x} (x-\mu)^2 f(x) dx
    \end{eqnarray*}

again :math:`f(x)` is the probability density function and :math:`\mu = E[X]` the expected value.
The equations show the definition for a discrete and continous random variable, respectively.
An alternative and often more appropriate formulation is

.. math::
    \text{Var}[X] = E[X^2] - (E[X])^2

The variance is pseudo-linear, which means the following:

.. math::
    \text{Var}[aX+b] = a^2 \text{Var}[X]

.. _intro-covar:

A generalisation of the variance to multiple random variables yields the **Covariance**, which
is defined analogously

.. math::
    \text{Cov}[X,Y] = E[ \left( X-E[X] \right) \left( Y-E[Y] \right) ]

For the sake of completeness but without further comment, here are some properties of the covariance:

.. math::
    :nowrap:
    
    \begin{eqnarray*}
    \text{Cov}[X,X] &=& \text{Var}[X] \\
    \text{Cov}[X,Y] &=& E[X \, Y] - E[X] \, E[Y] \\
    \text{Cov}[X,Y] &=& \text{Cov}[Y,X] \\
    \text{Cov}[aX + b, Y] &=& a \text{Cov}[X,Y] \\
    \text{Cov}[X+Y,Z] &=& \text{Cov}[X,Z] + \text{Cov}[Y,Z]
    \end{eqnarray*}


.. todo:
    Information theory basics