Glossary
********

.. glossary::
    :sorted:

    0-1 loss function
        The cost of a missclassification is one, the cost of a correct
        classification zero.::

            if label == prediction:
                cost = 0.0
            else:
                cost = 1.0

        In mathematical formulations, this can also be written as:

        .. math::
            I \{ y_i \neq c(x_i) \}

    STD
        Supervised training data: [(:term:`feature`, label)]

    Training set
        The testing set is the collection of data which is used to train (fit) a model.
        For supervised learning, the training set is a list of tuples (the :term:`feature` and :term:`label`).
        In the case of unsupervised learning, the training set is simply the list of :term:`feature`\ s. (See also
        :term:`STD` and :term:`UTD`).
    
    Testing set
        In order to measure the :term:`testing error`, a set of data is needed
        that was not already seen by the model, i.e. not used to train (fit) the model.
        Hence, the part of available data which is reserved for this task is called the testing set.
        Testing data is only available in the case of supervised learning (yet, there may be exceptions).
        The format is determined by :term:`STD`.

    Generalization error
        The generalization error expresses how well a trained model behaves on any
        possible (so far not seen) input. Loosely speaking, it shows how well the model
        represents the true underlying problem. In general, it cannot be measured
        nor computed, at most estimated.
    
    Bootstrapped
        see :term:`Bootstrapping`

    Bootstrapping
        Given N samples, the bootstrapped dataset contains N points randomly drawn with replacement from
        the original dataset.

    Jacobian matrix
        Given a function with parameters :math:`x_i` and input values :math:`a_j`. The Jacobian matrix holds
        the first order partial derivative w.r.t every parameter, evaluated at each input value:
        
        .. math::
            \mathbf{J}(a)_{ij} = \left[ \, \frac{\partial f }{ \partial x_i}(a_j) \, \right] = \begin{pmatrix}
            \frac{ \partial f(a_0) }{ \partial x_0 } & \frac{ \partial f(a_0) }{ \partial x_1 } & \dots & \frac{ \partial f(a_0) }{ \partial x_M } \\
            \frac{ \partial f(a_1) }{ \partial x_0 } & \frac{ \partial f(a_1) }{ \partial x_1 } & \dots & \frac{ \partial f(a_1) }{ \partial x_M } \\
            \dots & & & \\
            \frac{ \partial f(a_N) }{ \partial x_0 } & \frac{ \partial f(a_N) }{ \partial x_1 } & \dots & \frac{ \partial f(a_N) }{ \partial x_M }
            \end{pmatrix}

    Gradient
        Given a function :math:`F(\vec{x})` with :math:`\vec{x}` being a vector of variables :math:`x_i`.
        Then, the gradient is the vector of all first order partial derivatives w.r.t each 
        component of :math:`\vec{x}`:
        
        .. math::
           \nabla(\vec{x}) := \left[ \frac{\partial F }{ \partial x_0 }(\vec{x}), \frac{\partial F }{ \partial x_1 }(\vec{x}),
           \dots, \frac{\partial F }{ \partial x_n }(\vec{x}) \right]^T 

    Hessian matrix
        .. math::
            H(\vec{x}) = \left[ \frac{ \partial^2 f }{ \partial x_i \partial x_j }(\vec{x}) \right]_{ij}

    Overfitting
        In many learning problems, one desires to come up with a model that minimizes
        the prediction error on a training set. Unfortunately, models with higher complexity
        often decrease the training error but rely too much on the concrete data set.
        
        As an example, consider polynomial fitting. A high order polynomial fits any training set
        well but also includes all noise from the data in the model. If the data (say 100 data points)
        is noisily distributed along a parabola, a polynom of high order (e.g. 100) will give a low
        training error but isn't capable of capturing the actually relevant information (the shape of
        the parabola).
        
        The term overfitting describes a situation where the complexity of the chosen model
        does not match the complexity of the actual underlying problem, resulting in a trained
        model that only works for the training data. In general, such situations cannot be detected
        automatically (as the training error is actually minimal). Typically the testing error
        will be very low while the generalization error is high.

    i.i.d
        Identically, independent distributed
    
    Maximum likelihood
        Given a model with some - so far unspecified - parameters, the maximum likelihood
        estimation denotes a method to determine the parameters optimally, given
        training data.
    
    Expectation-Maximization
        This term refers to an iterative algorithm scheme, consisting of two steps.
        It is applied in maximum likelihood methods, where some information is
        missing (hidden). In the E-step, the hidden states are estimated according
        to the currently computed model. Then, the model is again recomputed using the
        estimated hidden states in the M-step. These two steps are repeated, until
        the procedure converged.
    
    Model
        The core of each learning problem is the model. Given some training data
        and a learning goal, the model is a mathematical representation that captures
        the shape or structure of the underlying problem. The model can usually
        be trained, i.e. fitted to the training data, and produce an output that
        can be mapped to the learning goal.

    Training error
        The training error identifies the error statistics of a trained model
        over all data points in the training set.
    
    Testing error
        The training error identifies the error statistics of a trained model
        over all data points in the testing set. Usually, the testing set is a
        subset of the originally measured data, not used for training.
        As the testing set is limited, the testing error cannot be equal to the
        generalization error. Yet, it is used to give a hint how well the trained
        model will behave on new data.
    
    Label
      In supervised learning situations, the label specifies the optimal model
      outcome. In classification problems, the label is a class identifier, in
      regression problems it's the target value (in the function's range).
    
    UTD
        Unsupervised training data: [:term:`feature`]
    
    Regression
        Regression problems deal with adapting a continous function to measured
        training data. Often, the target function given in a general form, including
        parameters. The goal of regression is to determine parameter values, such that
        the training data is optimally represented by the function.
    
    Classification
        In classification problems, an input feature has to be assigned to a certain
        class. Often, classes represent some attributes or characteristics of the possible
        input, i.e. divide the input space in discrete regions. Classification can also
        be seen as a coding scheme, where an input string has to be assigned to one
        codebook entry.
    
    Feature
        Description of a data point. Represented by a vector [:math:`x_0, x_1, \dots, x_n`]
    
    Data point
        Training data point, usually represented as a :term:`feature`.