# Linear, Polynomial, and Multiple Regression via the Pseudoinverse

Supervised learning consists of fitting functions to data sets. The idea is that we have a sample of inputs and outputs from some process, and we want to come up with a mathematical model that predicts what the outputs would be for other inputs. Supervised machine learning involves programming computers to compute such models.

One of the simplest ways to construct a predictive algorithm is to assume a linear relationship between the input and output. Even if this assumption isn’t correct, it’s always useful to start with a linear model because it provides a baseline against which we can compare the accuracy of more complicated models. (We can only justify using a complicated model if it is significantly more accurate than a linear model.)

For example, let’s fit a linear model $y=mx + b$ to the following data set. That is, we want to find the values of $m$ and $b$ so that the line $y=mx+b$ most accurately represents the following data set.

\begin{align*} \left[ (0,1), (2,5), (4,3) \right] \end{align*}

While there is no single line that goes through all three points in the data set, we can choose the slope $m$ and intercept $b$ so that the line represents the data as accurately as possible. There are a handful of ways to accomplish this, the simplest of which involves leveraging the pseudoinverse from linear algebra.

To start, let’s set up the system of equations that would need to be satisfied in order for our model to have perfect accuracy on the data set. We do this by simply taking each point $(x,y)$ in our data set and plugging it into the model $y=mx+b.$

\begin{align*} (x,y) \quad &\to \quad y = m \cdot x + b \\ \\ (0,1) \quad &\to \quad 1 = m \cdot 0 + b \\ (2,5) \quad &\to \quad 5 = m \cdot 2 + b \\ (4,3) \quad &\to \quad 3 = m \cdot 4 + b \end{align*}

Now, we rewrite the above system using matrix notation:

\begin{align*} \begin{bmatrix} 1 \\ 5 \\ 3 \end{bmatrix} = \begin{bmatrix} m \cdot 0 + b \\ m \cdot 2 + b \\ m \cdot 4 + b \end{bmatrix} \quad \to \quad \begin{bmatrix} 1 \\ 5 \\ 3 \end{bmatrix} = \begin{bmatrix} 0 & 1 \\ 2 & 1 \\ 4 & 1 \end{bmatrix} \begin{bmatrix} m \\ b \end{bmatrix} \end{align*}

Based on muscle memory, you may be tempted to solve the matrix equation by inverting the coefficient matrix that’s multiplying the unknown:

\begin{align*} \begin{bmatrix} m \\ b \end{bmatrix} = \begin{bmatrix} 0 & 1 \\ 2 & 1 \\ 4 & 1 \end{bmatrix}^{-1} \begin{bmatrix} 1 \\ 5 \\ 3 \end{bmatrix} \end{align*}

Then you might remember that the inverse of a non-square matrix does not exist, and think that this was all fruitless.

But really, this is the main idea of the pseudoinverse method. The only difference is that before inverting the coefficient matrix, we multiply both sides of the equation by the transpose of the coefficient matrix so that it becomes square.

\begin{align*} \begin{bmatrix} 1 \\ 5 \\ 3 \end{bmatrix} &= \begin{bmatrix} 0 & 1 \\ 2 & 1 \\ 4 & 1 \end{bmatrix} \begin{bmatrix} m \\ b \end{bmatrix} \\[5pt] \begin{bmatrix} 0 & 2 & 4 \\ 1 & 1 & 1 \end{bmatrix} \begin{bmatrix} 1 \\ 5 \\ 3 \end{bmatrix} &= \begin{bmatrix} 0 & 2 & 4 \\ 1 & 1 & 1 \end{bmatrix} \begin{bmatrix} 0 & 1 \\ 2 & 1 \\ 4 & 1 \end{bmatrix} \begin{bmatrix} m \\ b \end{bmatrix} \\[5pt] \begin{bmatrix} 22 \\ 9 \end{bmatrix} &= \begin{bmatrix} 20 & 6 \\ 6 & 3 \end{bmatrix} \begin{bmatrix} m \\ b \end{bmatrix} \end{align*}

Geometrically, multiplying by the transpose takes the matrices involved in the equation and projects them onto the nearest subspace where a solution exists. This means we can now take the inverse:

\begin{align*} \begin{bmatrix} m \\ b \end{bmatrix} &= \begin{bmatrix} 20 & 6 \\ 6 & 3 \end{bmatrix}^{-1} \begin{bmatrix} 22 \\ 9 \end{bmatrix} \\[5pt] &= \dfrac{1}{24} \begin{bmatrix} 3 & -6 \\ -6 & 20 \end{bmatrix} \begin{bmatrix} 22 \\ 9 \end{bmatrix} \\[5pt] &= \dfrac{1}{24} \begin{bmatrix} 12 \\ 48 \end{bmatrix} \\[5pt] &= \begin{bmatrix} 1/2 \\ 2 \end{bmatrix} \end{align*}

Reading off the parameters $m = \dfrac{1}{2}$ and $b=2,$ we have the following linear model:

\begin{align*} y = \dfrac{1}{2} x + 2 \end{align*} We can use the pseudoinverse method to fit polynomial models as well. To illustrate, let’s fit a quadratic model $y = ax^2 + bx + c$ to the following data set:

\begin{align*} \left[ (0,1), (2,5), (4,3), (5,1) \right] \end{align*}

First, we set up our system of equations:

\begin{align*} (x,y) \quad &\to \quad y = a \cdot x^2 + b \cdot x + c \\ \\ (0,1) \quad &\to \quad 1 = a \cdot 0^2 + b \cdot 0 + c \\ (2,5) \quad &\to \quad 5 = a \cdot 2^2 + b \cdot 2 + c \\ (4,3) \quad &\to \quad 3 = a \cdot 4^2 + b \cdot 4 + c \\ (5,0) \quad &\to \quad 0 = a \cdot 5^2 + b \cdot 5 + c \\ \end{align*}

Then we convert to a matrix equation, multiply both sides by the transpose of the coefficient matrix, and invert the square matrix that results.

\begin{align*} \begin{bmatrix} 1 \\ 5 \\ 3 \\ 0 \end{bmatrix} &= \begin{bmatrix} 0^2 & 0 & 1 \\ 2^2 & 2 & 1 \\ 4^2 & 4 & 1 \\ 5^2 & 5 & 1 \end{bmatrix} \begin{bmatrix} a \\ b \\ c \end{bmatrix} \\[5pt] \begin{bmatrix} 0^2 & 2^2 & 4^2 & 5^2 \\ 0 & 2 & 4 & 5 \\ 1 & 1 & 1 & 1 \end{bmatrix} \begin{bmatrix} 1 \\ 5 \\ 3 \\ 0 \end{bmatrix} &= \begin{bmatrix} 0^2 & 2^2 & 4^2 & 5^2 \\ 0 & 2 & 4 & 5 \\ 1 & 1 & 1 & 1 \end{bmatrix} \begin{bmatrix} 0^2 & 0 & 1 \\ 2^2 & 2 & 1 \\ 4^2 & 4 & 1 \\ 5^2 & 5 & 1 \end{bmatrix} \begin{bmatrix} a \\ b \\ c \end{bmatrix} \\[5pt] \begin{bmatrix} 68 \\ 22 \\ 9 \end{bmatrix} &= \begin{bmatrix} 897 & 197 & 45 \\ 197 & 45 & 11 \\ 45 & 11 & 4 \end{bmatrix} \begin{bmatrix} a \\ b \\ c \end{bmatrix} \\[5pt] \begin{bmatrix} a \\ b \\ c \end{bmatrix} &= \begin{bmatrix} 897 & 197 & 45 \\ 197 & 45 & 11 \\ 45 & 11 & 4 \end{bmatrix}^{-1} \begin{bmatrix} 68 \\ 22 \\ 9 \end{bmatrix} \\[5pt] &\approx \begin{bmatrix} -0.75 \\ 3.42 \\ 1.02 \end{bmatrix} \end{align*}

Note that we used a computer to simplify the final step. You can start to see why computers are essential in machine learning (and this is just the tip of the iceberg).

We have the following quadratic model:

\begin{align*} y \approx -0.75x^2 + 3.42 x + 1.02 \end{align*} Lastly, the pseudoinverse method can also be used to fit models consisting of multiple input variables. For example, let’s fit a linear model $z = ax + by + c$ to the following data set:

\begin{align*} \left[ (0,1,50), (2,5,30), (4,3,20), (5,1,10) \right] \end{align*}

Here, we have two input variables, $x$ and $y.$ Our output variable is $z.$

As usual, we begin by setting up our system of equations:

\begin{align*} (x,y,z) \quad &\to \quad z = a \cdot x + b \cdot y + c \\ \\ (0,1,50) \quad &\to \quad 50 = a \cdot 0 + b \cdot 1 + c \\ (2,5,30) \quad &\to \quad 30 = a \cdot 2 + b \cdot 5 + c \\ (4,3,20) \quad &\to \quad 20 = a \cdot 4 + b \cdot 3 + c \\ (5,0,10) \quad &\to \quad 10 = a \cdot 5 + b \cdot 0 + c \\ \end{align*}

Then we convert to a matrix equation, multiply both sides by the transpose of the coefficient matrix, and invert the square matrix that results.

\begin{align*} \begin{bmatrix} 50 \\ 30 \\ 20 \\ 10 \end{bmatrix} &= \begin{bmatrix} 0 & 1 & 1 \\ 2 & 5 & 1 \\ 4 & 3 & 1 \\ 5 & 0 & 1 \end{bmatrix} \begin{bmatrix} a \\ b \\ c \end{bmatrix} \\[5pt] \begin{bmatrix} 0 & 2 & 4 & 5 \\ 1 & 5 & 3 & 0 \\ 1 & 1 & 1 & 1 \end{bmatrix} \begin{bmatrix} 50 \\ 30 \\ 20 \\ 10 \end{bmatrix} &= \begin{bmatrix} 0 & 2 & 4 & 5 \\ 1 & 5 & 3 & 0 \\ 1 & 1 & 1 & 1 \end{bmatrix} \begin{bmatrix} 0 & 1 & 1 \\ 2 & 5 & 1 \\ 4 & 3 & 1 \\ 5 & 0 & 1 \end{bmatrix} \begin{bmatrix} a \\ b \\ c \end{bmatrix} \\[5pt] \begin{bmatrix} 190 \\ 260 \\ 110 \end{bmatrix} &= \begin{bmatrix} 45 & 22 & 11 \\ 22 & 35 & 9 \\ 11 & 9 & 4 \end{bmatrix} \begin{bmatrix} a \\ b \\ c \end{bmatrix} \\[5pt] \begin{bmatrix} a \\ b \\ c \end{bmatrix} &= \begin{bmatrix} 45 & 22 & 11 \\ 22 & 35 & 9 \\ 11 & 9 & 4 \end{bmatrix}^{-1} \begin{bmatrix} 190 \\ 260 \\ 110 \end{bmatrix} \\[5pt] &\approx \begin{bmatrix} -7.74 \\ -0.60 \\ 50.12 \end{bmatrix} \end{align*}

So, we have the following model which represents a plane in $3$-dimensional space:

\begin{align*} z \approx -7.74 x - 0.60 y + 50.12 \end{align*} Note that the pseudoinverse method requires that the columns of the coefficient matrix be independent. Otherwise, when we multiply by the transpose, the result is not guaranteed to be invertible. In particular, this means that the number of parameters of the model that we want to fit (i.e. the width of the coefficient matrix) should not exceed the number of distinct data points in our data set (i.e. the height of the coefficent matrix).

To illustrate what happens when the number of parameters of the model exceeds the number of distinct data points, let’s trying to fit a line $y=mx+b$ to a data set $[(2,3)]$ consisting of only a single point. (The line has $2$ parameters, $m$ and $b.$)

\begin{align*} 3 &= m \cdot 2 + b \\[5pt] \begin{bmatrix} 3 \end{bmatrix} &= \begin{bmatrix} 2 & 1 \end{bmatrix} \begin{bmatrix} m \\ b \end{bmatrix} \\[5pt] \begin{bmatrix} 2 \\ 1 \end{bmatrix} \begin{bmatrix} 3 \end{bmatrix} &= \begin{bmatrix} 2 \\ 1 \end{bmatrix} \begin{bmatrix} 2 & 1 \end{bmatrix} \begin{bmatrix} m \\ b \end{bmatrix} \\[5pt] \begin{bmatrix} 6 \\ 3 \end{bmatrix} &= \begin{bmatrix} 4 & 2 \\ 2 & 1 \end{bmatrix} \begin{bmatrix} m \\ b \end{bmatrix} \\[5pt] \begin{bmatrix} 4 & 2 \\ 2 & 1 \end{bmatrix}^{-1} & \textrm{does not exist} \end{align*}

Although multiplying by the transpose gives us a square matrix, the square matrix is not invertible. This happens because there are infinitely many lines that have pass through the point $(2,3)$ and therefore have perfect accuracy on the data set.

Looking back at our work, we can write down the general procedure as follows, where $y$ is the vector of outputs, $X$ is the coefficients matrix of our system of equations, and $p$ is the vector of model parameters.

\begin{align*} y &= X p \\ X^T y &= X^T X p \\ p &= (X^T X)^{-1} X^T y \end{align*}

The pseudoinverse of the matrix $X$ is defined by $(X^T X)^{-1} X^T,$ as shown in the last row of the general equation. To understand why this quantity is called the pseudoinverse, first recall that if we attempt to solve the equation $y = Xp$ using the regular matrix, then we get a solution of the form

\begin{align*} y = X^{-1} p. \end{align*}

The issue that we run into is that the regular inverse $X^{-1}$ usually does not exist (because $X$ is usually a tall rectangular matrix, not a square matrix). The best approximation of the solution is

\begin{align*} y = (X^T X)^{-1} X^T p, \end{align*}

which takes a similar form to the previous equation, except that the inverse $X^{-1}$ is replaced by the pseudoinverse $(X^T X)^{-1} X^T.$

Finally, note that quantitative models are usually referred to as regression models when the goal is to predict some numeric value. This is contrasted with classification models, where the goal is to predict the category or class that best represents an input. So far, we have only encountered regression models, but we will learn about classification models soon.

Practice Problems

Use the pseudoinverse method to fit the given model to the given data set. Check your answer by sketching the resulting model on a graph containing the data points and verifying that it visually appears to capture the trend of the data. Remember that the model does not need to actually pass through all the data points (this is usually impossible).

1. Fit $y = mx + b$ to $[(1,0), (3,-1), (4,5)].$
2. Fit $y = mx + b$ to $[(-2,3), (1,0), (3,-1), (4,5)].$
3. Fit $y = ax^2 + bx + c$ to $[(-2,3), (1,0), (3,-1), (4,5)].$
4. Fit $y = ax^2 + bx + c$ to $[(-3,-4), (-2,3), (1,0), (3,-1), (4,5)].$
5. ($\ast$) Fit $y = ax^3 + bx^2 + cx + d$ to $[(-3,-4), (-2,3), (1,0), (3,-1), (4,5)].$
6. Fit $z = ax + by + c$ to $[(-2,3,-3), (1,0,-4), (3,-1,2), (4,5,3)].$

Tags: