5. selection — Model selection and validation

5.1. Information Criteria

In maximum likelihood methods, the Bayesian and Akaike information criteria give a measurement

Let L = \max\limits_{\theta} p(\theta|\vec{x},\vec{y}) be the likelihood of the most probable model parameters \theta with respect to the training data \vec{x}, \vec{y}. Furthermore, let the model have k free parameters that are estimated and let there be n training samples. The two information criteria are defined in the following way:

  1. Bayesian information criterion

    \text{BIC} := -2 \ln(L) + k \ln(n)

  2. Akaike information criterion

    \text{AIC} := - 2 \ln(L) + 2k

From this definition, it can be seen that the criteria add a complexity penalty to the parameter likelihood. The reason behind this is the following: it’s assumed that the higher the number of parameters, the better any set of training data can be approximated by the model. For example, consider polynomial fitting. The number of free parameters corresponds to the order of the fitted polynomial. Given a degree equal to the number of training samples, the polynomial can be fitted perfectly. But as it will be a curve through all given data pairs, the generalization power will be poor. The goal is to find a model with reasonable number of parameters but at the same time a low training error (i.e. a high maximum likelihood).

If the error is normal i.i.d with variance \hat{\sigma}^2, the criteria can be stated as:

\text{BIC} & = & n \ln \hat{\sigma}^2 + k \ln(n) \\
\text{AIC} & = & n \ln \hat{\sigma}^2 + 2k

Instead of the Akaike information criterion, one should rather empoly the corrected AIC:

\text{AICc} := \text{AIC} + \frac{2k(k + 1)}{n - k - 1}

As n gets large, the two criteria are identical. If n is relatively small, the original AIC may suggest models with a larger number of parameters, thus is more likely to be subject of overfitting.

5.2. K-Fold Cross-Validation

The cross-validation method tries to estimate the generalization error of a model, given some training data. For K-fold cross-validation, the N samples are split into K subsets of equal size. Then, K-1 subsets are used as training set and the remaining subset as testing set:

  1. Split the data into K subsets.

  2. For k=1..K

    1. Fit the model using all subsets except the k\text{'th}.
    2. Compute the testing error err_k on the k\text{'th} subset.
  3. The estimated generalization error is the average of the errors of the K testing sets.

    err_{CV} = \frac{1}{K} \sum\limits_{k=1}^K err_k

If K=N, each sample is once used as testing set (consisting of only this sample). This special case is called Leave-one-out cross-validation.

5.3. Random Sample Consensus

Given a set of training data that is noisy and contains outliers, i.e. erroneous data. The aim of the RANSAC algorithm is to determine which data points are outliers and compute the model without those.

Given the minimal number of points M required to fit the model and a threshold, the algorithm does the following:

  1. Iterate until convergence

    1. Out of all data points, select M points randomly (the root set).
    2. Fit the model to these.
    3. Compute the prediction error for all data points.
    4. Add all points for which the error is below a threshold to the consensus set of this iteration.
  2. Fit the model using all points in the largest consensus set.

As convergence criterion, the probability that at least one outlier free root set was chosen can be tracked. Let s be the number of iterations and r the ratio of inliers and outliers.

p = 1 - (1 - r^M)^s

Because r is not known beforehand, it is estimated by \frac{\text{\# inliers in the best model}}{\text{\# data points}}. The algorithm iterates, until p exceeds some value (usually close to one, e.g. p > 0.99).

5.4. Interfaces

ailib.selection.crossValidation(data, model, K)
  • data (STD) – Data points.
  • model (ailib.fitting.Model) – Model to fit the data points to.
  • K (int) – Number of buckets.

Cross-validation error.

class ailib.selection.Ransac(thres, minPts, prob=0.98999999999999999, maxIter=100)
  • thres (float) – Error threshold.
  • prob (float) – Minimum probability of finding an outlier-free root set.
  • maxIter (int) – Maximum number of iterations.
  • minPts (int) – Minimum number of points the model requires to be computed.
fit(data, model)

model instance, fitted to the optimal consensus set.