4.3.1. Linear Regression Generalized linear models

Given an input vector \vec{x} = \left[ x_1, x_2, \dots, x_p \right]^T and coefficients \vec{\beta} = \left[ \beta_0, \beta_1, \beta_2, \dots, \beta_p \right]^T. A generalized linear model is the function of the form:

f(\vec{x}) = \beta_0 + \sum\limits_{j=1}^p x_j \beta_j

This notation describes any function with linear dependence on the parameters \beta_j. For example, a polynomial can be written as

f(\vec{x}) = \beta_0 + \sum\limits_{j=1}^p x_j \beta_j \quad \text{with } \vec{x} = \left[ x, x^2, \dots, x^p \right]

The fitting problem focuses on finding the optimal parameters \beta_j of the function f(\vec{x}) for N given training pairs (\vec{x}_i, \, y_i), with y_i = f(\vec{x}_i) being the target output. In this context, optimal means that the allocation of \vec{\beta} needs to minimize the sum of squared residuals:

\text{argmin}_\beta \, \text{RSS}(\beta) & = & \text{argmin}_\beta \, \sum\limits_{i=1}^N w_i \left( y_i - f(x_i) \right)^2 \\
           & = & \text{argmin}_\beta \, \left( \vec{y} - \mathbf{X} \vec{\beta} \right)^T W \left( \vec{y} - \mathbf{X} \vec{\beta} \right)

The lower equation is just there to get rid of the sum. The matrix \mathbf{X} holds all the training input vectors (size N\text{x}(p+1)), the matrix W is a diagonal matrix with W_{ii} = w_i.

As known from basic analysis, the extremal points (here the minimum) can be found by setting the function derivative to zero. In the problem setting, the training samples are fixed and the optimization goes for the parameters, thus we’re interested in the derivative w.r.t the parameters \beta

\frac{ \partial \text{RSS}(\beta) }{ \partial \beta } = 0

The nice property of generalized linear models is that this derivation can easily be determined:

\frac{ \partial \text{RSS}(\beta) }{ \partial \beta } = -2 \mathbf{X}^T \mathbf{W} \left( \vec{y} - \mathbf{X} \vec{\beta} \right)

Putting the above ideas toghether yields the closed form solution of the fitting problem:

\frac{ \partial \text{RSS}(\beta) }{ \partial \beta } & = & 0 \\
\mathbf{X}^T \mathbf{W} \left( \vec{y} - \mathbf{X} \vec{\beta} \right) & = & 0 \\
\mathbf{X}^T \mathbf{W} \vec{y} - \mathbf{X}^T \mathbf{W} \mathbf{X} \vec{\beta} & = & 0 \\
\mathbf{X}^T \mathbf{W} \vec{y} & = & \mathbf{X}^T \mathbf{W} \mathbf{X} \vec{\beta} \\
\left( \mathbf{X}^T \mathbf{W} \mathbf{X} \right)^{-1} \mathbf{X}^T \mathbf{W} \vec{y} & = & \vec{\beta} \\

The generalization to K outputs can be done, if \vec{y} is written as (NxK)-matrix with each row a set of output values. Then the coefficients are computed the same way as above, \vec{\beta} becoming a ((p+1)\text{x}K) matrix. Note that the outputs are independent of each other, thus solving the system with multiple outputs it is equivalent to solving the system once for each single output. [ESLII] Ridge regression

The above described least squares routine may lead to severe overfitting problems. The generalization error might even decrease, if some of the coefficients are set to zero.

Instead of removing a coefficient subset, the ridge regression introduces a penalty on the size of the coefficients. The idea is that the sum of squared coefficients may not exceed a threshold. This forces less significant coefficients to vanish while the most influencing ones shouldn’t be changed too heavily. Using lagrange multipliers, the optimization problem is stated the following way:

\hat{\beta}^{\text{ridge}} =  \text{argmin}_\beta \, w_i \left(  \sum\limits_{i=1}^N ( y_i - f_\beta(x_i) )^2 + \lambda \sum\limits_{j=1}^p \beta_j^2 \right)

The parameter \lambda \geq 0 is introduced to control the amount of shrinkage. The larger \lambda, the more the coefficients are constrained. For \lambda=0, the problem becomes identical to the ordinary least squares.

As for ordinary least squares, the sum of squared residuals can be formulated

\text{RSS}(\lambda) = (\vec{y} - \mathbf{X} \vec{\beta})^T \mathbf{W} (\vec{y} - \mathbf{X} \vec{\beta}) + \lambda \vec{\beta}^T \vec{\beta}

and by its derivation with respect to \beta, the closed form solution is found:

\hat{\beta}^{\text{ridge}} = \left( \mathbf{X}^T \mathbf{W} \mathbf{X} + \lambda \mathbf{I} \right)^{-1} \mathbf{X}^T \mathbf{W} \vec{y} Interfaces

class ailib.fitting.LinearModel(lambda_=0.0)

Generalized linear model base class.

Although possible, it’s generally not a good idea to use this class directly. Instead, use a derived classes, such as BasisRegression.

Subclasses are required to write the coeff member after fitting. Otherwise some routine calls may fail or produce wrong results.

Parameter:lambda (float) – Shrinkage factor.
err(x, y)
Squared residual.

Evaluate the model at data point x.

The last computed coefficients are used.


This routine doesn’t work, if x is a scipy.matrix

Parameter:x ([float]) – (p+1) Basis inputs.
Returns:Model output.
fit(data, labels, weights=None)

Fit the model to some training data.

  • data ([a]) – Training inputs.
  • labels ([b]) – Training labels.
  • weights ([float]) – Sample weights.


Number of basis values.
Number of output values.
class ailib.fitting.BasisRegression(lambda_=0.0)

Least squares regression with arbitrary basis input.

This class allows to use arbitrary basis functions. The resulting model is

f(x) = c0*b0(x) + c1*b1(x) + c2*b2(x) + c3*b3(x) + ...

with b0(x) := 1.0

The basis functions can be registered using the addBasis method. Any function can be used, the only requirement is that the type of its only argument is the same type as the model input x (see below).

Parameter:lambda (float) – Shrinkage factor.

Add a basis function.

Parameter:b ((Num b) :: a -> b) – Basis function.
Evaluate the model at data point x.
fit(data, labels, weights=None)

Fit the model to some training data.

  • data (a) – Training inputs.
  • labels (b) – Training labels.
  • weights ([float]) – Sample weights.


class ailib.fitting.PolynomialRegression(deg, lambda_=0.0)

Polynomial regression.

For fitting a polynomial f(x) = c0 + c1*x + c2*x**2 + c3*x**3 + ...

  • deg (int) – Degree of polynomial (largest exponent)
  • lambda (float) – Shrinkage factor. Examples

Let’s define a toy function that generates some samples and fits a polynom using them.

def polyEst(lo, hi, numSamples, sigma, deg, l=0.0):

    # Setting up the true polynomial and estimator
    poly = lambda x: 1.0 + 2.0*x  + 3.0*x**2 + 0.5*x**3 + 0.25*x**4 + 0.125*x**5
    est = ailib.fitting.PolynomialRegression(deg, lambda_=l)

    # Generate samples from the polynomial
    x = [lo + (hi - lo) * random.random() for i in range(numSamples)]
    y = [poly(i) + random.gauss(0.0, sigma) for i in x]

    # Fit the estimator to the samples
    return est, (x,y)

# y = exp(a0*x0) * exp(a1*x1) * exp(a2*x2)
# ln(y) = ln( ... ) = ln(exp(a0*x0)) + ln(exp(a1*x1) + ln(exp(a2*x2)) = a0*x0 + a1*x1 + a2*x2
>>> print est(0.0, 1.0, 1000, 1.0, 5)
Poly(1.06417448201 * x**0 + 1.90233085087 * x**1 + -1.80439247299 * x**2 + 25.2719255838 * x**3 + -38.4765407136 * x**4 + 19.0506853189 * x**5)

Table Of Contents

Previous topic

4. fitting — Model Fitting

Next topic

4.3.2. Non-Linear least squares

This Page