The cost of a missclassification is one, the cost of a correct classification zero.:
if label == prediction:
    cost = 0.0
else:
    cost = 1.0
In mathematical formulations, this can also be written as:

 ]
]Given a function  with
 with  being a vector of variables
 being a vector of variables  .
Then, the gradient is the vector of all first order partial derivatives w.r.t each
component of
.
Then, the gradient is the vector of all first order partial derivatives w.r.t each
component of  :
:
![\nabla(\vec{x}) := \left[ \frac{\partial F }{ \partial x_0 }(\vec{x}), \frac{\partial F }{ \partial x_1 }(\vec{x}),
\dots, \frac{\partial F }{ \partial x_n }(\vec{x}) \right]^T](_images/math/c6afe3d05fdb402dbe1fd705478e8863861f1cf6.png)
![H(\vec{x}) = \left[ \frac{ \partial^2 f }{ \partial x_i \partial x_j }(\vec{x}) \right]_{ij}](_images/math/4159d3e63aaa9398e94f291ee19aca98cd67b1f7.png)
Given a function with parameters  and input values
 and input values  . The Jacobian matrix holds
the first order partial derivative w.r.t every parameter, evaluated at each input value:
. The Jacobian matrix holds
the first order partial derivative w.r.t every parameter, evaluated at each input value:
![\mathbf{J}(a)_{ij} = \left[ \, \frac{\partial f }{ \partial x_i}(a_j) \, \right] = \begin{pmatrix}
\frac{ \partial f(a_0) }{ \partial x_0 } & \frac{ \partial f(a_0) }{ \partial x_1 } & \dots & \frac{ \partial f(a_0) }{ \partial x_M } \\
\frac{ \partial f(a_1) }{ \partial x_0 } & \frac{ \partial f(a_1) }{ \partial x_1 } & \dots & \frac{ \partial f(a_1) }{ \partial x_M } \\
\dots & & & \\
\frac{ \partial f(a_N) }{ \partial x_0 } & \frac{ \partial f(a_N) }{ \partial x_1 } & \dots & \frac{ \partial f(a_N) }{ \partial x_M }
\end{pmatrix}](_images/math/60faea0a6ab5c2a22338dd747bb6b9928f11b24d.png)
In many learning problems, one desires to come up with a model that minimizes the prediction error on a training set. Unfortunately, models with higher complexity often decrease the training error but rely too much on the concrete data set.
As an example, consider polynomial fitting. A high order polynomial fits any training set well but also includes all noise from the data in the model. If the data (say 100 data points) is noisily distributed along a parabola, a polynom of high order (e.g. 100) will give a low training error but isn’t capable of capturing the actually relevant information (the shape of the parabola).
The term overfitting describes a situation where the complexity of the chosen model does not match the complexity of the actual underlying problem, resulting in a trained model that only works for the training data. In general, such situations cannot be detected automatically (as the training error is actually minimal). Typically the testing error will be very low while the generalization error is high.