Contents
The KMeans algorithm is used for clustering of unsupervised data.
Classically, the algorithm is sketches as follows:
As also hinted by the above description, the KMeans algorithm is an instance of an ExpectationMaximization algorithm. The problem can be formalized as
where are the data points, the cluster centers and an assignment function that stores the cluster index for each sample. Setting the derivative w.r.t. the cluster centers to zero yields the recomputation (M) step. This is the stationary condition, so assignments are fixed. In the EStep, the opposite is the case: the assignments are optimized while the center positions are constant. This yields the optimization problem
Obviously, the solution is to assign a data point to the closest cluster is the EStep from the algorithm.
In this library, instead of hard assignments like in the above algorithm, the data can be assigned to the clusters in a probabilistic way. The algorithm then becomes:
Initialize assignment probability matrix: .
Iterate until convergence
Compute the assignment probabilities (EStep):
Recompute the cluster centers (MStep):
For a new data point , compute the assignment probabilities
and sample from the discrete distribution formed by .
The matrix stores the assignment probabilities to all clusters for each sample. Note that . This representation comes from a maximum entropy approach.
The risk is defined as in the ‘normal’ algorithm. The EStep assumes fixed cluster centers, so the risk can be expressed as:
and inserted into the gibbs distribution (which maximizes the entropy) with parameter .
Note that denotes the set of all possible assignments. The equation can be reformulated to give the probability of an assignment (given the cluster centers ) and from there, the assignment probabilities used in the EStep are found. Further, the entropy maximization for the cluster center is:
This equation can be again solved using the stationary condition (setting the second derivative w.r.t. to zero) and gets the MStep of the algorithm.
In this representation, the parameter is introduced. A large temperature leads to fewer diverse centers and uniform assignment probability of the data to the centers. A small temperature results in the inverse effect: The centers become distinct and the assignment probability of each sample tends towards one for one cluster and zero for the others. Loosely speaking, the temperature controls how strongly the clusters are influenced by the data, i.e how sensitive the algorithm is to the input data.
Bases: ailib.fitting.kNN.kNN
KMeans algorithm.
This is an implementation of the probabilistic KMeans algorithm, not the often used Lloyd algorithm. The default evaluation method is to assign the sample to the closest cluster. Instead, the class label can be sampled from the distances, i.e. the closer to a cluster, the more probable this cluster label will be returned. If this behaviour is desired, use the Kmeans.Soft mixin.
There are several convergence criterions:
Select one of those by assignment to obj._converged.
Parameters: 


Sample from clusters according to their distances.
Parameters: 


Returns:  self 