4.3.7. Clustering K-Means

The K-Means algorithm is used for clustering of unsupervised data.

Classically, the algorithm is sketches as follows:

  1. Initialize cluster centers at randomly chosen data points.
  2. Iterate until convergence
    1. Assign each data point to the closest cluster (E-Step).
    2. Recompute the cluster centers as the average of all associated data points (M-Step).
  3. A new data point is classified according to the closest cluster center.

As also hinted by the above description, the K-Means algorithm is an instance of an Expectation-Maximization algorithm. The problem can be formalized as

\text{arg}\min\limits_{c,y} \frac{1}{N} \sum\limits_{i}^{N} \| x_i - y_{c(i)} \|^2

where x are the data points, y the cluster centers and c(i) an assignment function that stores the cluster index for each sample. Setting the derivative w.r.t. the cluster centers to zero yields the recomputation (M) step. This is the stationary condition, so assignments are fixed. In the E-Step, the opposite is the case: the assignments are optimized while the center positions are constant. This yields the optimization problem

c(i) = \text{arg}\min\limits_{j} \| x_i - y_{j} \|^2

Obviously, the solution is to assign a data point to the closest cluster is the E-Step from the algorithm.

In this library, instead of hard assignments like in the above algorithm, the data can be assigned to the clusters in a probabilistic way. The algorithm then becomes:

  1. Initialize assignment probability matrix: \mathbf{P}_{i,\alpha} = \frac{1}{N}.

  2. Iterate until convergence

    1. Compute the assignment probabilities (E-Step):

      \mathbf{P}_{i,\alpha} = \frac{ \exp \left( - (x_i - y_{\alpha})^2 / T \right) }{ \sum\limits_{\nu=1}^K \exp \left( - (x_i - y_{\nu})^2 / T \right)}

    2. Recompute the cluster centers (M-Step):

      y_{\alpha} = \frac{ \sum\limits_{i=1}^N \mathbf{P}_{i,\alpha} x_i }{ \sum\limits_{i=1}^N \mathbf{P}_{i,\alpha} }

  3. For a new data point x, compute the assignment probabilities

    \mathbf{P}(\alpha) = \frac{ \exp \left( - (x - y_{\alpha})^2 / T \right) }{ \sum\limits_{\nu=1}^K \exp \left( - (x - y_{\nu})^2 / T \right)}

    and sample from the discrete distribution formed by \{ \mathbf{P}(\alpha) \, | \, \alpha \in [1,K] \}.

The matrix \mathbf{P} stores the assignment probabilities to all clusters for each sample. Note that \sum_{\alpha} \mathbf{P}_{i,\alpha} = 1. This representation comes from a maximum entropy approach.

The risk is defined as in the ‘normal’ algorithm. The E-Step assumes fixed cluster centers, so the risk can be expressed as:

R(\alpha) = \sum\limits_i (x_i - y_\alpha)^2

and inserted into the gibbs distribution (which maximizes the entropy) with parameter T.

\mathbf{P}(c;y_{\nu}) = \frac{ \exp( -R(c)/T ) }{ \sum\limits_{c' \in \mathcal{C}} \exp( -R(c')/T ) }

Note that \mathcal{C} denotes the set of all possible assignments. The equation can be reformulated to give the probability of an assignment c (given the cluster centers y_{\nu}) and from there, the assignment probabilities \mathbf{P}_{i,\alpha} used in the E-Step are found. Further, the entropy maximization for the cluster center is:

y_\alpha = \text{arg}\max_{y_\nu} - \sum_{c \in \mathcal{C}} \mathbf{P}(c;y_{\nu}) \log \mathbf{P}(c;y_{\nu})

This equation can be again solved using the stationary condition (setting the second derivative w.r.t. y_{\nu} to zero) and gets the M-Step of the algorithm.

In this representation, the parameter T is introduced. A large temperature T leads to fewer diverse centers and uniform assignment probability of the data to the centers. A small temperature results in the inverse effect: The centers become distinct and the assignment probability of each sample tends towards one for one cluster and zero for the others. Loosely speaking, the temperature controls how strongly the clusters are influenced by the data, i.e how sensitive the algorithm is to the input data. Interfaces

class ailib.fitting.Kmeans(numClusters, tol=1, maxIter=100, temp=1.0, dist=<function <lambda> at 0x36eacf8>)

Bases: ailib.fitting.kNN.kNN

K-Means algorithm.

This is an implementation of the probabilistic K-Means algorithm, not the often used Lloyd algorithm. The default evaluation method is to assign the sample to the closest cluster. Instead, the class label can be sampled from the distances, i.e. the closer to a cluster, the more probable this cluster label will be returned. If this behaviour is desired, use the Kmeans.Soft mixin.

There are several convergence criterions:

  • Risk is below a threshold (absoluteRisk).
  • risk decrease is below a threshold (relativeRisk).
  • Stop after a number of iterations (iterCriterion).
  • Assignments don’t change (centerCriterion).

Select one of those by assignment to obj._converged.

  • numClusters (int) – Number of clusters.
  • tol (float) – Threshold for the convergence criterion.
  • maxIter (int) – Limit to number of iterations.
  • temp (float) – Temperature.
  • dist (a -> a -> float) – Distance measurement
class Soft

Sample from clusters according to their distances.

Return the class of x, sampled from the available clusters.
Kmeans.fit(data, labels=None, weights=None, T=None)
  • data ([a]) – Training data.
  • labels (None) – Unused.
  • weigths – Unused.
  • T – Temperature. If set, overwrites member.


Table Of Contents

Previous topic

4.3.6. Trees and Random Forests

Next topic

5. selection — Model selection and validation

This Page