The cost of a missclassification is one, the cost of a correct classification zero.:
if label == prediction:
cost = 0.0
else:
cost = 1.0
In mathematical formulations, this can also be written as:
Given a function with being a vector of variables . Then, the gradient is the vector of all first order partial derivatives w.r.t each component of :
Given a function with parameters and input values . The Jacobian matrix holds the first order partial derivative w.r.t every parameter, evaluated at each input value:
In many learning problems, one desires to come up with a model that minimizes the prediction error on a training set. Unfortunately, models with higher complexity often decrease the training error but rely too much on the concrete data set.
As an example, consider polynomial fitting. A high order polynomial fits any training set well but also includes all noise from the data in the model. If the data (say 100 data points) is noisily distributed along a parabola, a polynom of high order (e.g. 100) will give a low training error but isn’t capable of capturing the actually relevant information (the shape of the parabola).
The term overfitting describes a situation where the complexity of the chosen model does not match the complexity of the actual underlying problem, resulting in a trained model that only works for the training data. In general, such situations cannot be detected automatically (as the training error is actually minimal). Typically the testing error will be very low while the generalization error is high.