Non-Linear least squares
************************

.. contents::

.. module:: ailib.fitting

Where not otherwise denoted, the main sources are [IMM3215]_ and `wikipedia <http://en.wikipedia.org>`_.

Let :math:`M_\theta(x)` be a function of :math:`x` with parameters :math:`\theta`.
Unlike :ref:`linear-generalized-linear-models`, the function :math:`M` may be
nonlinear in the parameters :math:`\theta`, for example

.. math::
    M_\theta(x) = \theta_0 \exp(\theta_1 x) + \theta_2 \exp(\theta_3 x)
    \quad \text{with} \, \theta = [ \theta_0, \theta_1, \theta_2, \theta_3 ]^T

The aim of non-linear least squares is to find the parameters :math:`\theta` such that
the function fits a set of training data (input and output values of :math:`M`)
optimally with respect to the sum of squares residual.

.. math::
    \theta^* = \text{argmin}_\theta \sum\limits_i (y_i - M_\theta(x_i))^2

The input and output values :math:`x,y` can be considered to be constant, as the
minimization is over the function parameters :math:`\theta`. To make the notation
simpler, the functions :math:`f,F` are further defined as:

.. math::
    :nowrap:
    
    \begin{eqnarray*}
    f_i(\theta) & = & y_i - M_\theta(x_i) \\
    F(\theta) & = & \sum\limits_i (y_i - M_\theta(x_i))^2 = \sum\limits_i f_i(\theta)^2
    \end{eqnarray*}

and the optimization problem can therefore be reformulated:

.. math::
    \theta^* = \text{argmin}_\theta F(\theta)

.. note::
    In contrast to :ref:`linear-generalized-linear-models`, the presented algorithms only
    find local minimal. A local minima of a function :math:`F` is defined by the
    following equation, given a (small) region :math:`\delta`:
    
    .. math::
        F_(\theta^*) \leq F_(\theta) \quad \forall \| \theta - \theta^* \| < \delta


.. _nlsq-gradient-descent:

Gradient descent methods
========================

Descent methods iteratively compute the set :math:`\theta_{t+1}` from a prior
set :math:`\theta_{t}` by

.. math::
    \theta_{t+1} = \theta_{t} + \alpha \vec{h}

with the initial set :math:`\theta_0` given. The moving direction :math:`\vec{h}`
and step size :math:`\alpha` have to be determined in such a way that the step
was actually a descent, i.e. the following equation holds for each step :math:`t`:

.. math::
    F(\theta_{t+1}) < F(\theta_{t})


The step size :math:`\alpha` can be set to a constant, e.g. :math:`\alpha = 1` or
determined by :ref:`nlsq-line-search` (or another method). A constant value is
justified as the line search algorithm may take a lot of time.

.. _nlsq-steepest-descent:

Steepest descent
----------------

Let :math:`\nabla(\theta)` be the gradient of :math:`F(\theta)`.

.. math::
   \nabla(\theta) := \left[ \frac{\partial F(\theta) }{ \partial \theta_0 }, \frac{\partial F(\theta) }{ \partial \theta_1 },
   \dots, \frac{\partial F(\theta) }{ \partial \theta_n } \right]^T 

The gradient :math:`\nabla(\theta)` can be seen as the direction of the steepest increase
in the function :math:`F(\theta)`. Hence, the **steepest descent method** or also named **gradient method**
uses its negative as moving direction :math:`\vec{h} = -\nabla(\theta)`.

.. comment
    .. math::
        \lim\limits_{a \rightarrow 0} \frac{ F(\theta) - F(\theta+\alpha \vec{h}) }{ \alpha \| \vec{h} \| }
        = -\frac{1}{\|\vec{h}\|} \vec{h}^T \nabla(\theta)
        = -\frac{1}{\|\vec{h}\|} \| \vec{h} \| \, \| \nabla(\theta) \| \, cos(\phi)
        = -\| \nabla(\theta) \| \cos(\phi)

    with :math:`\phi` the angle between :math:`\nabla(\theta)` and :math:`\vec{h}`. For :math:`\phi = \pi` the
    decrease becomes maximal and thus the steepest descent is :math:`-\nabla(\theta)`.


The choice of the gradient is reasonable, as it is locally the best choice since the descent is maximal.
This method converges but convergence is only linear, thus it is often used if :math:`\theta` is
still far from the minimum :math:`\theta^*`.

The steepest descent algorithm can be summarized as follows:

1. Init :math:`\theta = \theta_0`.
2. Iterate
    1. :math:`\vec{h} = -\nabla(\theta)`.
    2. Compute :math:`\alpha` (using :ref:`nlsq-line-search` or another method).
    3. :math:`\theta = \theta + \alpha \vec{h}`.
    4. Stop, if :math:`\vec{h}` is small enough.

.. _nlsq-newton:

Newton's method
---------------

Let :math:`H(\theta)` be the Hessian matrix of :math:`F(\theta)`

.. math::
    \mathbf{H}(\theta) = \left[ \frac{ \partial^2 F(\theta) }{ \partial \theta_i \, \partial \theta_j } \right] 

As :math:`\theta^*` is an extremal point, the gradient must be zero :math:`\nabla(\theta^*) = 0`.
Thus, :math:`\theta^*` is a stationary point of the gradient. Using a second order Taylor
expansion of the gradient around :math:`\theta`, one gets the equation

.. math::
    \nabla(\theta + \vec{h}) ~= \nabla(\theta) + \mathbf{H}(\theta) \vec{h}

Setting :math:`\nabla(\theta + \vec{h}) = 0` yields the moving direction of **Newton's method**:

.. math::
    \mathbf{H}(\theta) \vec{h} = -\nabla(\theta)

The algorithm can be defined analogous to the :ref:`nlsq-steepest-descent` algorithm.

1. Init :math:`\theta = \theta_0`.
2. Iterate
    1. Solve :math:`\mathbf{H}(\theta) \vec{h} = -\nabla(\theta)`.
    2. Compute :math:`\alpha` (using :ref:`nlsq-line-search` or another method).
    3. :math:`\theta = \theta + \alpha \vec{h}`.
    4. Stop, if :math:`\vec{h}` is small enough.

Newton's method has up to quadratic convergence but may converge to a maximum.
Thus it is only appropriate, if the estimate :math:`\theta` is already close to the
minimizer :math:`\theta^*`. A big problem with Newton's method is that the second derivative
is required, a prerequisite that can often not be fulfilled.


To combine the strengths of both methods, a **hybrid algorithm** can be defined

1. Init :math:`\theta = \theta_0`.
2. Iterate
    1. If :math:`H(\theta)` is positive definite, Newton's method is used:
       :math:`\vec{h} = - \mathbf{H}^{-1}(\theta) \nabla(\theta)`.
    2. Otherwise, the gradient method is applied:
       :math:`\vec{h} = -\nabla(\theta)`.
    3. Compute :math:`\alpha` (using :ref:`nlsq-line-search` or another method).
    4. :math:`\theta = \theta + \alpha \vec{h}`.
    5. Stop, if :math:`\vec{h}` is small enough.


.. _nlsq-line-search:

Line search
-----------

All above algorithms require a computation of the step size :math:`\alpha`. The
line search algorithm either computes the optimal or an estimate step size when
the current estimate :math:`\theta` and moving direction :math:`\vec{h}` are given.

Let's consider the function

.. math::
    \varphi(\alpha) = F(\theta + \alpha \vec{h})

The goal is to find the step size that minimizes the above function. If :math:`\alpha`
is too low, it should be increased in order to speed up convergence. If it is to high,
the error might not decrease (if :math:`\varphi(\alpha)` has a minimum), thus has to be decreased.

A nice property of this function is that its derivation can easily be seen to be

.. math::
    \varphi'(\alpha) = \vec{h}^T \nabla(\theta + \alpha \vec{h})

The **exact line search** algorithm aims to find a good approximation of the minimum
whereas the **soft line search** method only ensures that the :math:`alpha` is within
some boundary. The soft line search algorithm is often preferred, as it is usually
faster and a close approximation to the minimum is not required.

The soft line search uses the two following constraints for :math:`\alpha_s`:

.. math::
    :nowrap:
    
    \begin{eqnarray*}
    \varphi(\alpha_s) & \leq & \varphi(0) + \varrho \alpha_s \varphi'(0) \\
    \varphi'(\alpha_s) & \geq & \beta \varphi'(0)
    \end{eqnarray*}

with :math:`\varrho \in (0,0.25)` and :math:`\beta \in (\varrho,1)`. The first
constraint ensures that the error decreases, the second one inhibits that
the decrease is too small.

The line search algorithm comes from the idea to construct an interval :math:`[a,b]`
which defines the range of acceptable values for :math:`\alpha` and successively
narrow down this interval.

Let's define :math:`\lambda(\alpha) := \varphi(0) + \varrho \alpha \varphi'(0)`, the
right hand side of the first constraint. :math:`\alpha_{\text{max}}` is a parameter
that inhibits the right side of the interval :math:`b` to become arbitrary large, in
the case of :math:`F` being unbounded.

1. Initialize :math:`a = 0`
2. Initialize :math:`b = \min(1, \alpha_{\text{max}})`
3. Iterate while :math:`\varphi(b) \leq \lambda(b)` and :math:`\varphi'(b) \leq \beta \varphi'(0)` and :math:`b < \alpha_{\text{max}}`
    1. :math:`a = b`
    2. :math:`b = \min(2b, \alpha_{\text{max}})`
4. :math:`\alpha = b`
5. Iterate while :math:`\varphi(\alpha) > \lambda(\alpha)` or :math:`\varphi'(\alpha) < \beta \phi'(0)`
    1. Refine :math:`\alpha, a, b`

The first iteration shifts the initial interval to the right as long as the second
constraint is not fulfilled. The second iteration shrinks the interval until a solution
for :math:`\alpha` is found for which both constraints hold. The refinement routine is defined as:

1. Initialize :math:`D = b-a`
2. Initialize :math:`c = \frac{ \varphi(b) - \varphi(a) - D \varphi'(a) }{ D^2 }`
3. If :math:`c > 0`, set :math:`\alpha = \min \left( \max \left( a - \frac{ \varphi'(a) }{ (2c) }, a + 0.1 D \right), b - 0.1 D \right)`
4. Otherwise, set :math:`\alpha = \frac{ a+b }{ 2 }`
5. If :math:`\varphi(\alpha) < \lambda(\alpha)`, set :math:`a = \alpha`
6. Otherwise, set :math:`b = \alpha`

The first branching adjusts the :math:`\alpha` by approximating :math:`\varphi` with
a second order polynomial. If the approximation has a minimum, it is taken, otherwise
:math:`\alpha` is set to the midpoint of the interval :math:`[a,b]`. In the first
case, it is made sure that the interval decreases by at least 10%. The second
branching reduces the interval, such that the new interval still contains acceptable
points.

The difference between soft line search and exact line search is the second iteration
in the algorithm. For the exact line search, line (5) becomes

5. Iterate while :math:`| \varphi'(\alpha) | > \tau | \varphi'(0) |` and :math:`b-a > \epsilon`
    1. Refine :math:`\alpha, a, b`

with :math:`\tau,\epsilon > 0` being error tolerance parameters. [IMM3217]_


.. _nlsq-gauss-newton:

Gauss-Newton algorithm
======================

Let :math:`N` be the number of samples and :math:`M` the number of parameters.
Note that :math:`\mathbf{f}(\theta) := \left[ f_0(\theta), f_1(\theta), \dots, f_{n-1}(\theta) \right]^T`
is defined. Furthermore, the :math:`NxM` Jacobian matrix is

.. math::
    \mathbf{J}(\theta)_{ij} = \left[ \, \frac{\partial f_i(\theta) }{ \partial \theta_j} \, \right]

The **Gauss-Newton algorithm** approximates the function :math:`\mathbf{f}(\theta+\vec{h})`
using a second order taylor expansion :math:`\mathbf{l}(\vec{h})` around :math:`\theta`.
This then is used to formulate the minimization problem
:math:`\text{argmin}_{\vec{h}} \tilde{F}(\theta + \vec{h}) := \mathbf{l}^T(\vec{h}) \mathbf{l}(\vec{h})`
which is equivalent to solving the linear equation
:math:`\left( \mathbf{J}(\theta)^T \mathbf{J}(\theta) \right) \vec{h} = -\mathbf{J}(\theta)^T \mathbf{f}(\theta)`.
As in :ref:`nlsq-newton`, the rationale behind this is that the minimum of the
approximation can actually be computed. While Newton's method minimizes the
approximated gradient :math:`\nabla`, the Gauss-Newton algorithm approximates the function
:math:`f` directly.

1. Init :math:`\theta = \theta_0`.
2. Iterate
    1. Solve :math:`\left( \mathbf{J}^T(\theta) \mathbf{J}(\theta) \right) \vec{h} = -\mathbf{J}^T(\theta) \mathbf{f}(\theta)`.
    2. Compute :math:`\alpha` (using :ref:`nlsq-line-search` or another method).
    3. :math:`\theta = \theta + \alpha \vec{h}`.
    4. Stop, if :math:`\vec{h}` is small enough.

The algorithm converges, if :math:`\mathbf{J}(\theta)` has full rank in all
iteration steps and if :math:`\theta_0` is close enough to :math:`\theta^*`. This
implies that the choice of the initial parameters not only influences the result
(which local minima is found) but also convergence.
For :math:`\theta` close to :math:`\theta^*`, convergence becomes quadratic.

The algorithm can be easily be extended to handle sample weights. Let :math:`\mathbf{W}`
be the matrix with the weights in its diagonal :math:`\mathbf{W}_{ii} = w_i`. Then, the
linear equation to be solved in the iteration step (1) becomes

.. math::
    \left( \mathbf{J}^T(\theta) \mathbf{W} \mathbf{J}(\theta) \right) \vec{h} = -\mathbf{J}^T(\theta) \mathbf{W} \mathbf{f}(\theta)


.. _nlsq-levenberg-marquardt:

Levenberg-Marquardt algorithm
=============================

The Levenberg-Marquardt algorithm introduces a damping parameter :math:`\mu \geq 0` to
the :ref:`nlsq-gauss-newton`. Using
:math:`L(\vec{h}) := \frac{1}{2} \mathbf{l}(\vec{h})^T \mathbf{l}(\vec{h})`,
the minimization problem is reformulated as

.. math::
    \vec{h} = \text{argmin}_{\vec{h}} \left( L(\vec{h}) + \frac{1}{2} \mu \vec{h}^T \vec{h} \right)

and can be computed by solving by the linear equation

.. math::
    \left( \mathbf{J}(\theta)^T \mathbf{J}(\theta) + \mu \mathbf{I} \right) \vec{h} = -\mathbf{J}(\theta)^T  \mathbf{f}(\theta)

From the first equation, it can be seen that the damping parameter adds a penalty
proportional to the length of the step :math:`\vec{h}`. Therefore, the step size
can be controlled by :math:`\mu` which makes the step size parameter :math:`\alpha` obsolete.
The damping parameter :math:`mu` can intuitively be described by the following two cases:

1. :math:`\mu` is large. The linear equation is approximately :math:`\vec{h} = -\frac{1}{\mu} \nabla(\theta)`, thus
    equivalent to the :ref:`nlsq-steepest-descent` method with small step size.
2. :math:`\mu` is small (zero in the extreme). The optimization problem becomes more
    similar to the original one. This is desired in the final state, when :math:`\theta`
    is close to :math:`\theta^*`.

.. _nlsq-gain-ratio:

Like in :ref:`nlsq-line-search`, the damping parameter can be adjusted iteratively.
The gain ratio calculates the ratio between the actual and predicted decrease and
is defined as

.. math::
    \varrho = \frac{ F(\theta) - F(\theta + \vec{h}) }{ L(\vec{0}) - L(\vec{h})) }

where :math:`L(\vec{h})` is a second order Taylor expansion of :math:`F` around :math:`\theta`.
The denominator can be seen to be :math:`\frac{1}{2} \vec{h}^T \left( \mu \vec{h} - \nabla \right)`.
Using this, the damping parameter is set in each step according to the following routine (according to Nielsen):

.. _nlsq-nielsen:

1. If :math:`\varrho > 0`
    1. :math:`\mu = \mu \max \left( \frac{1}{3}, 1-(2 \varrho - 1)^3 \right)`.
    2. :math:`\nu = 2`

2. Otherwise
    1. :math:`\mu = \mu \, \nu`
    2. :math:`\nu = 2\nu`

with :math:`\nu = 2`, initially. The gain ratio is negative if the step :math:`\vec{h}`
does not actually decrease the target function :math:`F`. If the gain ratio is below
zero as in step (2), the current estimate of :math:`\theta` is not updated. 


The written-out **Levenberg-Marquardt algorithm** with adaptive damping parameter looks
like this:

1. Initialize :math:`\nu = 2`
2. Initialize :math:`\theta = \theta_0`
3. Initialize :math:`\mu = \tau \max\limits_i \left[ \mathbf{J}(\theta_0)^T \mathbf{J}(\theta_0) \right]_{ii}`
4. Iterate
    1. Solve :math:`\left( \mathbf{J}^T(\theta) \mathbf{J}(\theta) + \mu \mathbf{I} \right) \vec{h} = -\mathbf{J}^T(\theta) \mathbf{f}(\theta)`.
    2. Compute the gain ratio. :math:`\varrho = \frac{ F(\theta) - F(\theta + \vec{h}) }{ L(\vec{0}) - L(\vec{h})) }`
    3. If the gain ratio is positive, compute :math:`\mu,\nu` and update the parameter estimate :math:`\theta = \theta + \vec{h}`
    4. Otherwise adjust :math:`\mu,\nu`
    5. Stop, if the change in :math:`\theta` or the gradient is small enough.

:math:`\tau` and :math:`\theta_0` are parameters of the algorithm and therefore have to be
provided by the user. The algorithm could also be formulated like the
:ref:`nlsq-gauss-newton` with a constant dumping parameter.
Yet, this would do not much more than introduce an additional parameter, thus won't
be benefitial.

Like in the :ref:`nlsq-gauss-newton`, the optimization problem can also incorporate
sample weights:

.. math::
    \left( \mathbf{J}^T(\theta) \mathbf{W} \mathbf{J}(\theta) + \mu \mathbf{I} \right) \vec{h} = -\mathbf{J}^T(\theta) \mathbf{W} \mathbf{f}(\theta)


.. >> Powell's Dog Leg Method <<
    Gauss-Newton, steepest descent, controlled via trust region

    J * h ~= -f
    (J^T * J) * h_gn = -J^T * f
    h_sd = -g = -J*f (steepest descent)

    alpha = norm(g)**2 / norm(J*g)**2
    alpha = h_sd^T * J * f / norm(J*h_sd)**2
    => Choose one of those => alg below

    if norm(h_gn) <= delta
        h_dl = h_gn
    else if norm(alpha * h_sd) >= delta
        h_dl = delta / norm(h_sd) * h_sd
    else
        h_dl = alpha * h_sd + beta * ( h_gn - alpha * h_sd )
        => beta s.t. norm(h_dl) = delta

    >> Trust region & Damped methods <<

    => Trust region method not used, can be left out !?

    F(x+h) ~= L(h) := F(x) + h^T * c + 1/2 h^T * B * h
     c in R^n
     B in R^{nxn}, symmetric

    => Second order Tyler expansion around x

    Trust region method
     positive \delta
     h = h_tr = argmin_{\| h \| <= \delta} L(h)
     => assume model is sufficiently accurate inside a ball with radius \delta, centered at x

    Trust region method
      if \sigma < 0.25
        \delta = \delta / 2 (=p1)
      else if \sigma > 0.75
        \delta = max( \delta, 3 (=p2) * \| h \| )
     => choose p1,p2 such that \delta don't oscillate


Interfaces
==========

.. autoclass:: NLSQ
    :members:
    :inherited-members:

.. autoclass:: SteepestDescent
   :members:
   :show-inheritance:

.. autoclass:: Newton
   :members:
   :show-inheritance:

.. autoclass:: HybridSDN
   :members:
   :show-inheritance:

.. autoclass:: GaussNewton
   :members:
   :show-inheritance:

.. autoclass:: Marquardt
   :members:
   :show-inheritance:


Examples
========