Linear Regression
*****************

.. contents::

.. module:: ailib.fitting


.. _linear-generalized-linear-models:

Generalized linear models
=========================

.. TODO: linear in parameters

Given an input vector :math:`\vec{x} = \left[ x_1, x_2, \dots, x_p \right]^T` and coefficients
:math:`\vec{\beta} = \left[ \beta_0, \beta_1, \beta_2, \dots, \beta_p \right]^T`.
A generalized linear model is the function of the form:

.. math::
    f(\vec{x}) = \beta_0 + \sum\limits_{j=1}^p x_j \beta_j

This notation describes any function with linear dependence on the parameters :math:`\beta_j`.
For example, a polynomial can be written as

.. math::
    f(\vec{x}) = \beta_0 + \sum\limits_{j=1}^p x_j \beta_j \quad \text{with } \vec{x} = \left[ x, x^2, \dots, x^p \right]

The fitting problem focuses on finding the *optimal* parameters :math:`\beta_j` of the function :math:`f(\vec{x})`
for :math:`N` given training pairs :math:`(\vec{x}_i, \, y_i)`, with :math:`y_i = f(\vec{x}_i)` being the target output.
In this context, optimal means that the allocation of :math:`\vec{\beta}` needs to minimize the sum of squared residuals:

.. math::
   :nowrap:

   \begin{eqnarray*}
   \text{argmin}_\beta \, \text{RSS}(\beta) & = & \text{argmin}_\beta \, \sum\limits_{i=1}^N w_i \left( y_i - f(x_i) \right)^2 \\
              & = & \text{argmin}_\beta \, \left( \vec{y} - \mathbf{X} \vec{\beta} \right)^T W \left( \vec{y} - \mathbf{X} \vec{\beta} \right)
   \end{eqnarray*}

The lower equation is just there to get rid of the sum. The matrix :math:`\mathbf{X}` 
holds all the training input vectors (size :math:`N\text{x}(p+1)`), the matrix :math:`W` is a diagonal matrix
with :math:`W_{ii} = w_i`.

As known from basic analysis, the extremal points (here the minimum) can be found by setting the
function derivative to zero. In the problem setting, the training samples are fixed and the
optimization goes for the parameters, thus we're interested in the derivative w.r.t the parameters :math:`\beta`

.. math::
    \frac{ \partial \text{RSS}(\beta) }{ \partial \beta } = 0


The nice property of generalized linear models is that this derivation can easily
be determined:

.. math::
   \frac{ \partial \text{RSS}(\beta) }{ \partial \beta } = -2 \mathbf{X}^T \mathbf{W} \left( \vec{y} - \mathbf{X} \vec{\beta} \right)

Putting the above ideas toghether yields the closed form solution of the fitting problem:

.. math::
   :nowrap:
   
   \begin{eqnarray*}
   \frac{ \partial \text{RSS}(\beta) }{ \partial \beta } & = & 0 \\
   \mathbf{X}^T \mathbf{W} \left( \vec{y} - \mathbf{X} \vec{\beta} \right) & = & 0 \\
   \mathbf{X}^T \mathbf{W} \vec{y} - \mathbf{X}^T \mathbf{W} \mathbf{X} \vec{\beta} & = & 0 \\
   \mathbf{X}^T \mathbf{W} \vec{y} & = & \mathbf{X}^T \mathbf{W} \mathbf{X} \vec{\beta} \\
   \left( \mathbf{X}^T \mathbf{W} \mathbf{X} \right)^{-1} \mathbf{X}^T \mathbf{W} \vec{y} & = & \vec{\beta} \\
   \end{eqnarray*}
   
The generalization to :math:`K` outputs can be done, if :math:`\vec{y}` is written as (:math:`NxK`)-matrix with each row a set
of output values. Then the coefficients are computed the same way as above, :math:`\vec{\beta}`
becoming a (:math:`(p+1)\text{x}K`) matrix. Note that the outputs are independent of each
other, thus solving the system with multiple outputs it is equivalent to solving the system
once for each single output. [ESLII]_

Ridge regression
----------------

The above described least squares routine may lead to severe :term:`overfitting` problems.
The :term:`generalization error` might even decrease, if some of the coefficients are
set to zero.

Instead of removing a coefficient subset, the ridge regression introduces a penalty on
the size of the coefficients. The idea is that the sum of squared coefficients may
not exceed a threshold. This forces less significant coefficients to vanish while the
most influencing ones shouldn't be changed too heavily. Using lagrange multipliers,
the optimization problem is stated the following way:

.. math::
   \hat{\beta}^{\text{ridge}} =  \text{argmin}_\beta \, w_i \left(  \sum\limits_{i=1}^N ( y_i - f_\beta(x_i) )^2 + \lambda \sum\limits_{j=1}^p \beta_j^2 \right)

The parameter :math:`\lambda \geq 0` is introduced to control the amount of shrinkage.
The larger :math:`\lambda`, the more the coefficients are constrained. For :math:`\lambda=0`,
the problem becomes identical to the :ref:`ordinary least squares <linear-generalized-linear-models>`.

As for ordinary least squares, the sum of squared residuals can be formulated

.. math::
   \text{RSS}(\lambda) = (\vec{y} - \mathbf{X} \vec{\beta})^T \mathbf{W} (\vec{y} - \mathbf{X} \vec{\beta}) + \lambda \vec{\beta}^T \vec{\beta}

and by its derivation with respect to :math:`\beta`, the closed form solution is found:

.. math::
   \hat{\beta}^{\text{ridge}} = \left( \mathbf{X}^T \mathbf{W} \mathbf{X} + \lambda \mathbf{I} \right)^{-1} \mathbf{X}^T \mathbf{W} \vec{y}


Interfaces
==========

.. autoclass:: LinearModel
   :members:

.. autoclass:: BasisRegression
   :members:

.. autoclass:: PolynomialRegression
   :members:

Examples
========

Let's define a toy function that generates some samples and fits a polynom using them.

.. literalinclude:: ../../examples/linear.py
    :language: python
    :pyobject: polyEst

>>> print est(0.0, 1.0, 1000, 1.0, 5)
Poly(1.06417448201 * x**0 + 1.90233085087 * x**1 + -1.80439247299 * x**2 + 25.2719255838 * x**3 + -38.4765407136 * x**4 + 19.0506853189 * x**5)