.. _intro: Introduction ************ .. contents:: .. _intro-about: About this library ================== This library provides several algorithms from the field of machine learning and statistics. A short introduction into statistics is given in the next chapter. The goal of this documentation is to provide a theoretical background additional to the implementation. Main topics covered so far are sampling methods, model fitting and model selection. The model fitting module provides various many different learning algorithms. Where possible, it was tried to stick to a statistics and information theory framework. Besides the `python standard library `_, the framework mainly uses `scipy `_ and `numpy `_. Although not all algorithms rely on them, it's strongly encouraged to use those libraries. Other projects that might be useful in what you try to achieve can be found at `scikit `_. Especially noteworthy is the package `scikit-learn `_ that covers a lot of machine learning algorithms. For statistical computation, there's a nice list of projects as `StatPy `_. Noteworthy are also the `SciTools `_ package, the `MatPlotLib `_ for plotting and the convex optimization package `cvxopt `_. If you're looking for more theoretical information, note the used :ref:`resources`. On the web, many topics are well covered by `wikipedia `_ and also the `StatSoft Statistics textbook `_ might be useful. .. _intro-type: Type notations -------------- It's important to understand the type signatures. It may help a lot to understand the structure of the code. Wherever a common type like float or int (or a type that mimics it) is expected, this is noted using the python syntax. Where a list is expected, also the respective python symbol is used. For exampe, [[float]] would be a matrix of floats whereas (float) is a tuple of floats. Lists and tuples are usually exchangable but it's strongly recommended to use the specified one. The syntax for functions and unspecified types is haskell-like. If the there are no restriction on a type, a character is paced instead. Several types are concatenated using an arrow (->) and the last type in the chain is the return value. Consider the example below: >>> (Num b) :: [a] -> b -> a This signature defines a function that takes a list of an arbitrary type and a second parameter of (possibly) another type and returns a value of the same type as the first parameter's list type. Additionally, if a type is requested to behave like a nummerical type (int or float), this is indicated as above for the type b. Several restrictions are seperated by comma. Besides the nummerical restriction, also Ord (orderd type) may appear. A function with this signature could be as simple as >>> lambda lst,idx: lst[idx] Usually, the type symbols are coherent within all methods of a class and often also with related classes (i.e. classes in the same document). If no type signature is given, the type is completely arbitrary. However, note that this is usually the case for abstract methods and type restrictions may be found in their concrete implementations. .. _intro-statistics: Statistics formulary ==================== .. _intro-combinatorics: Combinatorics ------------- Draws of objects from an urn. There are n objects and k draws. +---------------------+-----------+----------------------+ | | Ordered | Unordered | +---------------------+-----------+----------------------+ | With replacement | n^k | (n+k-1)!/((n-1)!*k!) | +---------------------+-----------+----------------------+ | Without replacement | n!/(n-k)! | n!/((n-k)!*k!) | +---------------------+-----------+----------------------+ .. _intro-rv: Random variables and probability mass function ---------------------------------------------- Consider an experiment with several possible outcomes where the outcome depends on some underlying random process. A prominent example is a coin toss where the possible results are head and tail. The set of all the experiment's outcomes is the **sample space** :math:`\mathcal{S}`. Depending on the nature of the experiment, it may be either discrete or continous. Such an experiment can be expressed through a **random variable** :math:`X`. A random variable can be viewed as a function that assigns a value (often a real number) to any element of the sample space :math:`\mathcal{S}`. .. math:: X: \mathcal{S} \rightarrow \mathcal{A} If the sample space :math:`\mathcal{S}` is continous, then so is the random variable :math:`X`. If :math:`\mathcal{S}` is countable, then :math:`X` is discrete. Note that for the space of a variable, always the calligraphic sign (e.g. :math:`\mathcal{S}, \mathcal{X}, \mathcal{Y}`) is used. The **probability mass function** :math:`P: \mathcal{A} \rightarrow [0,1]` gives the probability of seeing a specific outcome of a random variable. If the random variable is continous, this probability has to be zero, so :math:`P(X=x) = 0`. The probability of seeing any possible outcome must be one, hence .. math:: \sum\limits_{x \in \mathcal{S}} P(X=x) = 1 \int\limits_{x \in \mathcal{S}} P(X=x) dx = 1 Usually, the random variable is denoted with an uppercase letter (:math:`X`) whereas a lower case letter (:math:`x`) is used for a specific value of the variable. Also, in order to make mathematical terms more readable, the following notation is used: .. math:: P(x) := P(X=x) Given two random variables :math:`X,Y`, the probability that the respective outcomes are :math:`x` and :math:`y` are observed at the same time is called the **joint probability** :math:`P(x,y) := P(X=x,Y=y)`. .. _intro-independence: Two random variables are **independent**, if and only if .. math:: P(x,y) = P(x) P(y) holds. Independence means that the outcome of one random variable does not influence the outcome of the other one. If two random variables are independent, one can easily find the probability of one of the two variables by summing or integration over the other one, i.e. :math:`P(x) = \sum\limits_y P(x,y)` if the variables are discrete. The **conditional probability** :math:`P(x|y)` is the probability of seeing a certain outcome :math:`x`, given that the outcome of the other random variable was already observed to be :math:`y`. .. math:: P(x|y) = P(X=x \, | \, Y=y) := \frac{P(x,y)}{P(y)} For independent variables, it can be seen that :math:`P(x|y) = P(x)`, which intuitively makes sense as :math:`y` doesn't influence :math:`x`. The law of **total probability**, also called **marginalization**, states the following: .. math:: P(x) = \sum\limits_{y \in \mathcal{Y}} P(x|y) P(y) P(x) = \int\limits_{y \in \mathcal{Y}} P(x|y) P(y) dy .. _intro-bayes: Bayes ----- The **Bayes theorem** states the following: .. math:: P(x|y) = \frac{ P(y|x) P(x) }{ P(y) } The term :math:`P(x)` is the prior probability, as its independent of :math:`y` and can thus be interpreted as the probability of an outcome of :math:`X` where no information about the second variable :math:`Y` is available. In this view, the conditional probability :math:`P(x|y)` is the posterior, so the probability after some information was already observed. The denominator :math:`P(y)` is for normalization, hence also called normalizer. Using the law of total probability, the theorem can be re-written into the following form: .. math:: P(x|y) = \frac{ P(y|x) P(x) }{ \sum\limits_{x' \in \mathcal{X}} P(y|x') P(x') } If there are three random variables, the probability of the outcome :math:`x`, given that the two outcomes :math:`y,z` are observed is .. math:: P(x|y,z) = \frac{ P(y|x,z) P(x|z) }{ P(y|z) } Two variables :math:`X,Y` are **conditionally independent**, iff .. math:: P(x,y|z) = P(x|z) P(y|z) given a third random variable :math:`Z`. It follows that .. math:: P(x|z) = P(x|y,z) P(y|z) = P(y|x,z) In robotics, the term **Belief** is sometime used. Let :math:`u_1,z_1, \dots, u_t,z_t` be a sequence of measurements :math:`z_t` and actions :math:`u_t` up to time :math:`t`. The belief expressed the probability of being in the current state :math:`x_t`: .. math:: \text{Bel}(x_t) = P(x_t \, | \, u_1, z_1, \dots, u_t, z_t) .. _intro-dists: Probability distributions ------------------------- The term **probability distribution** is used to describe the collection of the probabilities for all possible outcomes. When talking about a probability distribution, usally the probability mass or density functions are referred to. Remember the **probability mass function** (pmf) :math:`P(X=x)`, already defined above. It expresses how probable a certain outcome is and it holds that .. math:: \sum\limits_{x in \mathcal{X}} P(X=x) = 1 In the discrete case, also the probability of several outcomes can be computed easily by .. math:: P(X \in A) = \sum\limits_{x \in A} P(X=x) For discrete :math:`P(X)`, the probability of any element is non-negative and strictly positive for at least one :math:`x \in \mathcal{X}`. On the other hand, if :math:`P(X)` is continous, the probability of one exact value goes to zero, hence :math:`P(X=x) = 0 \, \forall x`. In consequence, for continous functions we define .. math:: P(a \leq X \leq b) = \int\limits_a^b p(x) dx with :math:`p(x)` the **probability density function** (pdf). Analogous to the pmf, the **probability density function** describes how likely a certain outcome is for the random varaible. The pdf is also non-negative all probabilities sums up to one, :math:`\int\limits_{-\infty}^{\infty} p(x) dx = 1`. Further, we can define the **cumulative distribution function** (cdf): .. math:: F(x) := P(X \leq x) = \int\limits_{-\infty}^x f(t) dt which expresses the probability that the outcome of the experiment will be no larger than :math:`x`. Note the limits :math:`\lim\limits_{x \rightarrow -\infty} F(x) = 0` and :math:`\lim\limits_{x \rightarrow \infty} F(x) = 1`. From this definition, it's easily seen that .. math:: F(b) - F(a) = P(a \leq X \leq b) = \int\limits_a^b f(x) dx If :math:`P(X)` isn't continous at all places :math:`x`, the probability at the discontinuity is .. math:: P(X=x) = F(x) - \lim\limits_{y \rightarrow x-} F(y) which of course again goes to zero if :math:`P(X)` is continuous at :math:`x`. For sampling, the **inverse cdf** would be important. Unfortunately, a closed form representation is often not available. The inverse cdf is defined as .. math:: F^{-1}(y) := \inf\limits_{x} F(x) \geq y \qquad y \in [0,1] In other words this means that for a given :math:`y`, find the value :math:`x` such that :math:`F(x) = y`. There are several well known distributions defined in the section :ref:`core-dists` and also in the `scipy library `_. .. _intro-stats: Statistics ---------- Consider a set of data points. Likely, one wants to give some quantitative description of their shape. Also, one would like to be able to give some characterisations of known and well defined probability distributions. For this, let's first define the **Moment** .. math:: \mu_k := E[ X^k ] .. _intro-ev: using the definition of the **Expected value** (for discrete and continous random variables) .. math:: E[X] = \sum\limits_{x} x P(x) E[X] = \int\limits_{x} x f(x) dx Note that :math:`f(x)` is the probability density function of :math:`X`. The expected value is sometimes referenced as mean and misleadingly associated with the arithmetic mean. Of course this is only true, if the probability of the observations is uniform (all observations are equally likely) which is often the case for measured data (then the probability distribution is expressed by repetition of measurements). The expected value is linear, a fact that is formally noted by the following equation: .. math:: E[aX + bY + c] = a \, E[X] + b \, E[Y] + c If the two random variables :math:`X,Y` are independent, it also holds that .. math:: E[X \, Y] = E[X] \, E[Y] The **Central Moment** is the moment about the expected value, thus .. math:: \mu_k := E[ (X-E[X])^k ] From its definition, one can easily see that the zero'th central moment is one and the first central moment is zero. .. math:: \mu_0 = 1 \mu_1 = 0 .. _intro-var: The second central moment is the **Variance** .. math:: :nowrap: \begin{eqnarray*} \text{Var}[X] &:=& E[ (X-E[X])^2 ] \\ & =& \sum\limits_{x} (x-\mu)^2 P(x) \\ & =& \int\limits_{x} (x-\mu)^2 f(x) dx \end{eqnarray*} again :math:`f(x)` is the probability density function and :math:`\mu = E[X]` the expected value. The equations show the definition for a discrete and continous random variable, respectively. An alternative and often more appropriate formulation is .. math:: \text{Var}[X] = E[X^2] - (E[X])^2 The variance is pseudo-linear, which means the following: .. math:: \text{Var}[aX+b] = a^2 \text{Var}[X] .. _intro-covar: A generalisation of the variance to multiple random variables yields the **Covariance**, which is defined analogously .. math:: \text{Cov}[X,Y] = E[ \left( X-E[X] \right) \left( Y-E[Y] \right) ] For the sake of completeness but without further comment, here are some properties of the covariance: .. math:: :nowrap: \begin{eqnarray*} \text{Cov}[X,X] &=& \text{Var}[X] \\ \text{Cov}[X,Y] &=& E[X \, Y] - E[X] \, E[Y] \\ \text{Cov}[X,Y] &=& \text{Cov}[Y,X] \\ \text{Cov}[aX + b, Y] &=& a \text{Cov}[X,Y] \\ \text{Cov}[X+Y,Z] &=& \text{Cov}[X,Z] + \text{Cov}[Y,Z] \end{eqnarray*} .. todo: Information theory basics