4.3.2. Non-Linear least squares

Where not otherwise denoted, the main sources are [IMM3215] and wikipedia.

Let M_\theta(x) be a function of x with parameters \theta. Unlike Generalized linear models, the function M may be nonlinear in the parameters \theta, for example

M_\theta(x) = \theta_0 \exp(\theta_1 x) + \theta_2 \exp(\theta_3 x)
\quad \text{with} \, \theta = [ \theta_0, \theta_1, \theta_2, \theta_3 ]^T

The aim of non-linear least squares is to find the parameters \theta such that the function fits a set of training data (input and output values of M) optimally with respect to the sum of squares residual.

\theta^* = \text{argmin}_\theta \sum\limits_i (y_i - M_\theta(x_i))^2

The input and output values x,y can be considered to be constant, as the minimization is over the function parameters \theta. To make the notation simpler, the functions f,F are further defined as:

\begin{eqnarray*}
f_i(\theta) & = & y_i - M_\theta(x_i) \\
F(\theta) & = & \sum\limits_i (y_i - M_\theta(x_i))^2 = \sum\limits_i f_i(\theta)^2
\end{eqnarray*}

and the optimization problem can therefore be reformulated:

\theta^* = \text{argmin}_\theta F(\theta)

Note

In contrast to Generalized linear models, the presented algorithms only find local minimal. A local minima of a function F is defined by the following equation, given a (small) region \delta:

F_(\theta^*) \leq F_(\theta) \quad \forall \| \theta - \theta^* \| < \delta

4.3.2.1. Gradient descent methods

Descent methods iteratively compute the set \theta_{t+1} from a prior set \theta_{t} by

\theta_{t+1} = \theta_{t} + \alpha \vec{h}

with the initial set \theta_0 given. The moving direction \vec{h} and step size \alpha have to be determined in such a way that the step was actually a descent, i.e. the following equation holds for each step t:

F(\theta_{t+1}) < F(\theta_{t})

The step size \alpha can be set to a constant, e.g. \alpha = 1 or determined by Line search (or another method). A constant value is justified as the line search algorithm may take a lot of time.

4.3.2.1.1. Steepest descent

Let \nabla(\theta) be the gradient of F(\theta).

\nabla(\theta) := \left[ \frac{\partial F(\theta) }{ \partial \theta_0 }, \frac{\partial F(\theta) }{ \partial \theta_1 },
\dots, \frac{\partial F(\theta) }{ \partial \theta_n } \right]^T

The gradient \nabla(\theta) can be seen as the direction of the steepest increase in the function F(\theta). Hence, the steepest descent method or also named gradient method uses its negative as moving direction \vec{h} = -\nabla(\theta).

The choice of the gradient is reasonable, as it is locally the best choice since the descent is maximal. This method converges but convergence is only linear, thus it is often used if \theta is still far from the minimum \theta^*.

The steepest descent algorithm can be summarized as follows:

  1. Init \theta = \theta_0.

  2. Iterate
    1. \vec{h} = -\nabla(\theta).
    2. Compute \alpha (using Line search or another method).
    3. \theta = \theta + \alpha \vec{h}.
    4. Stop, if \vec{h} is small enough.

4.3.2.1.2. Newton’s method

Let H(\theta) be the Hessian matrix of F(\theta)

\mathbf{H}(\theta) = \left[ \frac{ \partial^2 F(\theta) }{ \partial \theta_i \, \partial \theta_j } \right]

As \theta^* is an extremal point, the gradient must be zero \nabla(\theta^*) = 0. Thus, \theta^* is a stationary point of the gradient. Using a second order Taylor expansion of the gradient around \theta, one gets the equation

\nabla(\theta + \vec{h}) ~= \nabla(\theta) + \mathbf{H}(\theta) \vec{h}

Setting \nabla(\theta + \vec{h}) = 0 yields the moving direction of Newton’s method:

\mathbf{H}(\theta) \vec{h} = -\nabla(\theta)

The algorithm can be defined analogous to the Steepest descent algorithm.

  1. Init \theta = \theta_0.

  2. Iterate
    1. Solve \mathbf{H}(\theta) \vec{h} = -\nabla(\theta).
    2. Compute \alpha (using Line search or another method).
    3. \theta = \theta + \alpha \vec{h}.
    4. Stop, if \vec{h} is small enough.

Newton’s method has up to quadratic convergence but may converge to a maximum. Thus it is only appropriate, if the estimate \theta is already close to the minimizer \theta^*. A big problem with Newton’s method is that the second derivative is required, a prerequisite that can often not be fulfilled.

To combine the strengths of both methods, a hybrid algorithm can be defined

  1. Init \theta = \theta_0.

  2. Iterate
    1. If H(\theta) is positive definite, Newton’s method is used: \vec{h} = - \mathbf{H}^{-1}(\theta) \nabla(\theta).
    2. Otherwise, the gradient method is applied: \vec{h} = -\nabla(\theta).
    3. Compute \alpha (using Line search or another method).
    4. \theta = \theta + \alpha \vec{h}.
    5. Stop, if \vec{h} is small enough.

4.3.2.2. Gauss-Newton algorithm

Let N be the number of samples and M the number of parameters. Note that \mathbf{f}(\theta) := \left[ f_0(\theta), f_1(\theta), \dots, f_{n-1}(\theta) \right]^T is defined. Furthermore, the NxM Jacobian matrix is

\mathbf{J}(\theta)_{ij} = \left[ \, \frac{\partial f_i(\theta) }{ \partial \theta_j} \, \right]

The Gauss-Newton algorithm approximates the function \mathbf{f}(\theta+\vec{h}) using a second order taylor expansion \mathbf{l}(\vec{h}) around \theta. This then is used to formulate the minimization problem \text{argmin}_{\vec{h}} \tilde{F}(\theta + \vec{h}) := \mathbf{l}^T(\vec{h}) \mathbf{l}(\vec{h}) which is equivalent to solving the linear equation \left( \mathbf{J}(\theta)^T \mathbf{J}(\theta) \right) \vec{h} = -\mathbf{J}(\theta)^T \mathbf{f}(\theta). As in Newton’s method, the rationale behind this is that the minimum of the approximation can actually be computed. While Newton’s method minimizes the approximated gradient \nabla, the Gauss-Newton algorithm approximates the function f directly.

  1. Init \theta = \theta_0.

  2. Iterate
    1. Solve \left( \mathbf{J}^T(\theta) \mathbf{J}(\theta) \right) \vec{h} = -\mathbf{J}^T(\theta) \mathbf{f}(\theta).
    2. Compute \alpha (using Line search or another method).
    3. \theta = \theta + \alpha \vec{h}.
    4. Stop, if \vec{h} is small enough.

The algorithm converges, if \mathbf{J}(\theta) has full rank in all iteration steps and if \theta_0 is close enough to \theta^*. This implies that the choice of the initial parameters not only influences the result (which local minima is found) but also convergence. For \theta close to \theta^*, convergence becomes quadratic.

The algorithm can be easily be extended to handle sample weights. Let \mathbf{W} be the matrix with the weights in its diagonal \mathbf{W}_{ii} = w_i. Then, the linear equation to be solved in the iteration step (1) becomes

\left( \mathbf{J}^T(\theta) \mathbf{W} \mathbf{J}(\theta) \right) \vec{h} = -\mathbf{J}^T(\theta) \mathbf{W} \mathbf{f}(\theta)

4.3.2.3. Levenberg-Marquardt algorithm

The Levenberg-Marquardt algorithm introduces a damping parameter \mu \geq 0 to the Gauss-Newton algorithm. Using L(\vec{h}) := \frac{1}{2} \mathbf{l}(\vec{h})^T \mathbf{l}(\vec{h}), the minimization problem is reformulated as

\vec{h} = \text{argmin}_{\vec{h}} \left( L(\vec{h}) + \frac{1}{2} \mu \vec{h}^T \vec{h} \right)

and can be computed by solving by the linear equation

\left( \mathbf{J}(\theta)^T \mathbf{J}(\theta) + \mu \mathbf{I} \right) \vec{h} = -\mathbf{J}(\theta)^T  \mathbf{f}(\theta)

From the first equation, it can be seen that the damping parameter adds a penalty proportional to the length of the step \vec{h}. Therefore, the step size can be controlled by \mu which makes the step size parameter \alpha obsolete. The damping parameter mu can intuitively be described by the following two cases:

  1. \mu is large. The linear equation is approximately \vec{h} = -\frac{1}{\mu} \nabla(\theta), thus

    equivalent to the Steepest descent method with small step size.

  2. \mu is small (zero in the extreme). The optimization problem becomes more

    similar to the original one. This is desired in the final state, when \theta is close to \theta^*.

Like in Line search, the damping parameter can be adjusted iteratively. The gain ratio calculates the ratio between the actual and predicted decrease and is defined as

\varrho = \frac{ F(\theta) - F(\theta + \vec{h}) }{ L(\vec{0}) - L(\vec{h})) }

where L(\vec{h}) is a second order Taylor expansion of F around \theta. The denominator can be seen to be \frac{1}{2} \vec{h}^T \left( \mu \vec{h} - \nabla \right). Using this, the damping parameter is set in each step according to the following routine (according to Nielsen):

  1. If \varrho > 0
    1. \mu = \mu \max \left( \frac{1}{3}, 1-(2 \varrho - 1)^3 \right).
    2. \nu = 2
  2. Otherwise
    1. \mu = \mu \, \nu
    2. \nu = 2\nu

with \nu = 2, initially. The gain ratio is negative if the step \vec{h} does not actually decrease the target function F. If the gain ratio is below zero as in step (2), the current estimate of \theta is not updated.

The written-out Levenberg-Marquardt algorithm with adaptive damping parameter looks like this:

  1. Initialize \nu = 2

  2. Initialize \theta = \theta_0

  3. Initialize \mu = \tau \max\limits_i \left[ \mathbf{J}(\theta_0)^T \mathbf{J}(\theta_0) \right]_{ii}

  4. Iterate
    1. Solve \left( \mathbf{J}^T(\theta) \mathbf{J}(\theta) + \mu \mathbf{I} \right) \vec{h} = -\mathbf{J}^T(\theta) \mathbf{f}(\theta).
    2. Compute the gain ratio. \varrho = \frac{ F(\theta) - F(\theta + \vec{h}) }{ L(\vec{0}) - L(\vec{h})) }
    3. If the gain ratio is positive, compute \mu,\nu and update the parameter estimate \theta = \theta + \vec{h}
    4. Otherwise adjust \mu,\nu
    5. Stop, if the change in \theta or the gradient is small enough.

\tau and \theta_0 are parameters of the algorithm and therefore have to be provided by the user. The algorithm could also be formulated like the Gauss-Newton algorithm with a constant dumping parameter. Yet, this would do not much more than introduce an additional parameter, thus won’t be benefitial.

Like in the Gauss-Newton algorithm, the optimization problem can also incorporate sample weights:

\left( \mathbf{J}^T(\theta) \mathbf{W} \mathbf{J}(\theta) + \mu \mathbf{I} \right) \vec{h} = -\mathbf{J}^T(\theta) \mathbf{W} \mathbf{f}(\theta)

4.3.2.4. Interfaces

class ailib.fitting.NLSQ(model, derivs, initial=None, lsParams=None, sndDerivs=None)

Non-linear least-squares base class.

This is an incomplete class for nonlinear least-squares algorithms. The class provides some basic routines that are commonly used.

Subclasses are required to write the coeff member after fitting. Otherwise some routine calls may fail or produce wrong results. The default strategy for the step size is a constant.

Warning

Local minima!

Throughout the documentation of this class, the following type symbols are used: - a: The parameters of the model. ???????? - b: The input of the model. May be arbitrary. - c: The output of the model. ????????? - N: Number of training samples. - M: Number of model function parameters.

The search line parameters are: - rho ? - beta ? - al_max ? - tau ? - eps ?

Warning

parameter dependence!

Parameters:
  • model ((Num c) :: a -> b -> c) – Model function to be fitted.
  • derivs ([(Num c) :: a -> b -> c]) – First order derivations of the model function. Expected is list where the i’th element represents the derivation w.r.t the i’th parameter.
  • initial (c) – Initial choice of parameters. The initial choice may influence result and convergence speed.
  • lsParams ([float]) – Line search parameters
  • sndDerivs ([[(Num c) :: a -> b -> c]]) – Second order partial derivations of the model function. Expected is a matrix of functions where the element (ij) represents the derivations w.r.t the i’th first and then the j’th parameter. Only required for the Hessian matrix.
err((x, y))
Squared residual.
eval(x)
Evaluate the model at data point x with coefficients after fitting.
fit(features, labels=None, weights=None)

Fit the model to the training data.

Parameters:
  • features – Training samples.
  • labels – Target values.
  • weights ([float]) – Sample weights.
Returns:

self

grad(theta, data)

Compute the Gradient of the error function.

The error function is dependent on the model function parameters and the training data. This information has to be provided.

Todo

Cannot handle weights. Implemented but commented out (not tested, not quite sure)

Parameters:
  • theta (a) – The parameters of the model function.
  • data (STD) – The training samples.
Returns:

Gradient as a (Nx1) matrix

hessian(theta, data)

Compute the Hessian matrix of the error function.

In order to compute the hessian, the first and second order partial derivatives have to be specified (use the constructor’s sndDerivs parameter). The error function is dependent on the model function parameters and the training data. This information has to be provided.

Todo

Cannot handle weights. Implemented but commented out (not tested, not quite sure)

Parameters:
  • theta (a) – The parameters of the model function.
  • data (STD) – The training samples.
Returns:

Hessian matrix (MxM)

jacobian(theta, data)

Compute the Jacobian matrix of the error function.

The error function is dependent on the model function parameters and the training data. This information has to be provided.

Todo

Cannot handle weights. Actually, as J_ij = [df<j>/dt<i>], weights don’t appear In the context of Gauss-Newton and Levenberg-Marquandt, J*W is used.

Parameters:
  • theta (a) – The parameters of the model function
  • data (STD) – The training samples.
Returns:

Jacobian matrix (NxM)

class ailib.fitting.SteepestDescent(model, derivs, initial=None, eps=0.001, maxIter=100, lsParams=None)

Bases: ailib.fitting.nlsq.NLSQ

Implementation of the Steepest descent algorithm.

The initial parameter values and convergence threshold have to be set either at object construction or at the call to the fit method. The latter overwrites the former.

The default strategy for step size computation is the Soft line search. This behaviour can be modified through the alpha member. Example:

>>> o = SteepestDescent(...)
>>> o.alpha = o._exactLineSearch

If the fitting fails due to an OverflowError, try modifying the line search parameters. Consider setting al_max to a relatively small value (<< 1.0), although this will increase convergence time.

Parameters:
  • model ((Num c) :: a -> b -> c) – Model function to be fitted.
  • derivs ([(Num c) :: a -> b -> c]) – First order derivations of the model function. Expected is list where the i’th element represents the derivation w.r.t the i’th parameter.
  • initial (c) – Initial choice of parameters. The initial choice may influence result and convergence speed.
  • eps (float) – Convergence threshold. The iteration stops, if the step length is below this threshold.
  • maxIter (int) – Maximum number of iterations.
  • lsParams ([float]) – Line search parameters.
fit(samples, labels, weights=None, theta=None, eps=None)

Determine the parameters of a model, s.t. it optimally fits the training data.

The parameters theta and eps are optional. If one of the is not set, its default value determined at object construction will be used (use the constructor’s initial and eps arguments).

Parameters:
  • samples ([]) – Training data.
  • labels ([]) – Training labels.
  • weights ([float]) – Sample weights.
  • theta (a) – Initial parameter choice.
  • eps (float) – Convergence threshold. The iteration stops, if the step length is below this threshold.
Returns:

self

class ailib.fitting.Newton(model, derivs, sndDerivs, initial=None, eps=None, maxIter=100, lsParams=None)

Bases: ailib.fitting.nlsq.NLSQ

Implementation of Newton’s method.

Parameter handling and step size strategy behave as in SteepestDescent

Parameters:
  • model ((Num c) :: a -> b -> c) – Model function to be fitted.
  • derivs ([(Num c) :: a -> b -> c]) – First order derivations of the model function. Expected is list where the i’th element represents the derivation w.r.t the i’th parameter.
  • sndDerivs ([[(Num c) :: a -> b -> c]]) – Second order partial derivations of the model function.
  • initial (c) – Initial choice of parameters. The initial choice may influence result and convergence speed.
  • eps (float) – Convergence threshold. The iteration stops, if the step length is below this threshold.
  • maxIter (int) – Maximum number of iterations.
  • lsParams ([float]) – Line search parameters
fit(samples, labels, weights=None, theta=None, eps=None)

Determine the parameters of a model, s.t. it optimally fits the training data.

The parameters theta and eps are optional. If one of the is not set, its default value determined at object construction will be used (use the constructor’s initial and eps arguments).

Parameters:
  • samples ([]) – Training data.
  • labels ([]) – Training labels.
  • weights ([float]) – Sample weights.
  • theta (a) – Initial parameter choice.
  • eps (float) – Convergence threshold. The iteration stops, if the step length is below this threshold.
Returns:

self

class ailib.fitting.HybridSDN(model, derivs, sndDerivs, initial=None, eps=None, maxIter=100, lsDescent=None, lsNewton=None)

Bases: ailib.fitting.nlsq.Newton

Implementation of the Hybrid SD/Newton algorithm.

The Newton step is chosen, if the Hessian matrix is positive definite. Otherwise, the steepest descent method is applied.

Parameter handling and step size strategy behave as in SteepestDescent

Parameters:
  • model ((Num c) :: a -> b -> c) – Model function to be fitted.
  • derivs ([(Num c) :: a -> b -> c]) – First order derivations of the model function. Expected is list where the i’th element represents the derivation w.r.t the i’th parameter.
  • sndDerivs ([[(Num c) :: a -> b -> c]]) – Second order partial derivations of the model function.
  • initial (c) – Initial choice of parameters. The initial choice may influence result and convergence speed.
  • eps (float) – Convergence threshold. The iteration stops, if the step length is below this threshold.
  • maxIter (int) – Maximum number of iterations.
  • lsDescent ([float]) – Line search parameters for the steepest descent part.
  • lsNewton ([float]) – Line search parameters for the newton part.
fit(samples, labels, weights=None, theta=None, eps=None)

Determine the parameters of a model, s.t. it optimally fits the training data.

The parameters theta and eps are optional. If one of the is not set, its default value determined at object construction will be used (use the constructor’s initial and eps arguments).

Parameters:
  • samples ([]) – Training data.
  • labels ([]) – Training labels.
  • weights ([float]) – Sample weights.
  • theta (a) – Initial parameter choice.
  • eps (float) – Convergence threshold. The iteration stops, if the step length is below this threshold.
Returns:

self

class ailib.fitting.GaussNewton(model, derivs, initial=None, eps=None, maxIter=100, lsParams=None)

Bases: ailib.fitting.nlsq.NLSQ

Implementation of the Gauss-Newton algorithm.

Parameter handling and step size strategy behave as in SteepestDescent

Parameters:
  • model ((Num c) :: a -> b -> c) – Model function to be fitted.
  • derivs ([(Num c) :: a -> b -> c]) – First order derivations of the model function. Expected is list where the i’th element represents the derivation w.r.t the i’th parameter.
  • initial (c) – Initial choice of parameters. The initial choice may influence result and convergence speed.
  • eps (float) – Convergence threshold. The iteration stops, if the step length is below this threshold.
  • maxIter (int) – Maximum number of iterations.
  • lsParams ([float]) – Line search parameters
fit(samples, labels, weights=None, theta=None, eps=None)

Determine the parameters of a model, s.t. it optimally fits the training data.

The parameters theta and eps are optional. If one of the is not set, its default value determined at object construction will be used (use the constructor’s initial and eps arguments).

Parameters:
  • samples ([]) – Training data.
  • labels ([]) – Training labels.
  • weights ([float]) – Sample weights.
  • theta (a) – Initial parameter choice.
  • eps (float) – Convergence threshold. The iteration stops, if the step length is below this threshold.
Returns:

self

class ailib.fitting.Marquardt(model, derivs, initial=None, eps0=0.01, eps1=0.01, tau=1.0, maxIter=100)

Bases: ailib.fitting.nlsq.NLSQ

Implementation of the Levenberg-Marquardt algorithm.

The damping parameter mu is initialized with tau and adjusted according to Nielsen’s strategy.

Parameters:
  • model ((Num c) :: a -> b -> c) – Model function to be fitted.
  • derivs ([(Num c) :: a -> b -> c]) – First order derivations of the model function. Expected is list where the i’th element represents the derivation w.r.t the i’th parameter.
  • initial (c) – Initial choice of parameters. The initial choice may influence result and convergence speed.
  • eps0 (float) – Convergence threshold. The iteration stops, if the maximal gradient component is below this threshold.
  • eps1 (float) – Convergence threshold. The iteration stops, if the relative change in the model parameters is below this threshold.
  • tau (float) – Constant for determining the initial damping. Use a small value (10**(-6)), if the initial parameters are believed to be a good approximation, otherwise use 10**(-3) or 1.0 (default)
  • maxIter (int) – Maximum number of iterations.
fit(samples, labels, weights=None, theta=None, eps0=None, eps1=None, tau=None)

Determine the parameters of a model, s.t. it optimally fits the training data.

The parameters theta, eps0, eps1 and tau are optional. If one of the is not set, its default value determined at object construction will be used (use the constructor’s initial, eps0, eps1 and tau arguments).

Parameters:
  • samples ([]) – Training data.
  • labels ([]) – Training labels.
  • weights ([float]) – Sample weights.
  • theta (a) – Initial parameter choice.
  • eps0 (float) – Convergence threshold. The iteration stops, if the maximal gradient component is below this threshold.
  • eps1 (float) – Convergence threshold. The iteration stops, if the relative change in the model parameters is below this threshold.
  • tau (float) – Constant for determining the initial damping. Use a small value (10**-6), if the initial parameters are believed to be a good approximation, otherwise use 10^-3 or 1.0 (default)
Returns:

self