Clustering
**********

.. contents::

.. module:: ailib.fitting

K-Means
=======

The K-Means algorithm is used for clustering of unsupervised data.

Classically, the algorithm is sketches as follows:

1. Initialize cluster centers at randomly chosen data points.
2. Iterate until convergence

   a) Assign each data point to the closest cluster (E-Step).
   b) Recompute the cluster centers as the average of all associated data points (M-Step).

3. A new data point is classified according to the closest cluster center.

As also hinted by the above description, the K-Means algorithm is an instance of an
:term:`Expectation-Maximization` algorithm. The problem can be formalized as

.. math::
    \text{arg}\min\limits_{c,y} \frac{1}{N} \sum\limits_{i}^{N} \| x_i - y_{c(i)} \|^2

where :math:`x` are the data points, :math:`y` the cluster centers and :math:`c(i)` an assignment
function that stores the cluster index for each sample. Setting the derivative w.r.t. the cluster
centers to zero yields the recomputation (M) step. This is the stationary condition, so assignments
are fixed. In the E-Step, the opposite is the case: the assignments are optimized while the
center positions are constant. This yields the optimization problem

.. math::
    c(i) = \text{arg}\min\limits_{j} \| x_i - y_{j} \|^2

Obviously, the solution is to assign a data point to the closest cluster is the E-Step from
the algorithm.

In this library, instead of hard assignments like in the above algorithm, the data can be
assigned to the clusters in a probabilistic way. The algorithm then becomes:

1. Initialize assignment probability matrix: :math:`\mathbf{P}_{i,\alpha} = \frac{1}{N}`.
2. Iterate until convergence

   a) Compute the assignment probabilities (E-Step):
   
      .. math::
         \mathbf{P}_{i,\alpha} = \frac{ \exp \left( - (x_i - y_{\alpha})^2 / T \right) }{ \sum\limits_{\nu=1}^K \exp \left( - (x_i - y_{\nu})^2 / T \right)}
   
   b) Recompute the cluster centers (M-Step):
   
      .. math::
         y_{\alpha} = \frac{ \sum\limits_{i=1}^N \mathbf{P}_{i,\alpha} x_i }{ \sum\limits_{i=1}^N \mathbf{P}_{i,\alpha} }

3. For a new data point :math:`x`, compute the assignment probabilities

    .. math::
        \mathbf{P}(\alpha) = \frac{ \exp \left( - (x - y_{\alpha})^2 / T \right) }{ \sum\limits_{\nu=1}^K \exp \left( - (x - y_{\nu})^2 / T \right)}

   and sample from the discrete distribution formed by :math:`\{ \mathbf{P}(\alpha) \, | \, \alpha \in [1,K] \}`.

The matrix :math:`\mathbf{P}` stores the assignment probabilities to all clusters for each sample. Note
that :math:`\sum_{\alpha} \mathbf{P}_{i,\alpha} = 1`. This representation comes from a maximum entropy approach.

The risk is defined as in the 'normal' algorithm. The E-Step assumes fixed cluster centers, so
the risk can be expressed as:

.. math::
    R(\alpha) = \sum\limits_i (x_i - y_\alpha)^2

and inserted into the gibbs distribution (which maximizes the entropy) with parameter :math:`T`.

.. math::
    \mathbf{P}(c;y_{\nu}) = \frac{ \exp( -R(c)/T ) }{ \sum\limits_{c' \in \mathcal{C}} \exp( -R(c')/T ) }

Note that :math:`\mathcal{C}` denotes the set of all possible assignments. The equation can be reformulated
to give the probability of an assignment :math:`c` (given the cluster centers :math:`y_{\nu}`) and from
there, the assignment probabilities :math:`\mathbf{P}_{i,\alpha}` used in the E-Step are found.
Further, the entropy maximization for the cluster center is:

.. math::
    y_\alpha = \text{arg}\max_{y_\nu} - \sum_{c \in \mathcal{C}} \mathbf{P}(c;y_{\nu}) \log \mathbf{P}(c;y_{\nu})

This equation can be again solved using the stationary condition (setting the second derivative w.r.t.
:math:`y_{\nu}` to zero) and gets the M-Step of the algorithm.

In this representation, the parameter :math:`T` is introduced. A large temperature :math:`T`
leads to fewer diverse centers and uniform assignment probability of the data to the centers.
A small temperature results in the inverse effect: The centers become distinct and the assignment
probability of each sample tends towards one for one cluster and zero for the others.
Loosely speaking, the temperature controls how strongly the clusters are influenced by the data, i.e
how sensitive the algorithm is to the input data.

Interfaces
==========

.. autoclass:: Kmeans
    :members:
    :show-inheritance:

Examples
========