Multiple Regression and Interaction Terms

by Justin Skycak on January 06, 2022

In many real-life situations, there is more than one input variable that controls the output variable.

This post is part of the book Introduction to Algorithms and Machine Learning: from Sorting to Strategic Agents. Suggested citation: Skycak, J. (2022). Multiple Regression and Interaction Terms. In Introduction to Algorithms and Machine Learning: from Sorting to Strategic Agents. https://justinmath.com/multiple-regression-and-interaction-terms/

In many real-life situations, there is more than one factor that controls the quantity we’re trying to predict. That is to say, there is more than one input variable that controls the output variable.

Example: Multiple Input Variables

For example, suppose that a food manufacturing company is testing out different ingredients on sandwiches, including peanut butter and roast beef. They fed sandwiches to subjects and counted the proportion of subjects who liked each sandwich.

We want to build a model that has $3$ input variables:

$\begin{align*} x_1 &= \textrm{scoops peanut butter} \\ x_2 &= \textrm{scoops jelly} \\ x_3 &= \textrm{slices beef} \end{align*}$

The model will predict $1$ output variable:

$\begin{align*} y &= \textrm{proportion subjects liked} \end{align*}$

Since this output variable must be between $0$ and $1,$ we will use logistic regression.

$\begin{align*} y &= \dfrac{1}{1 + e^{-(ax+b)}} \end{align*}$

The logistic model above is written with only a single input variable. Here, we have $3$ different input variables, so we will introduce a new term for each input variable:

$\begin{align*} y &= \dfrac{1}{1 + e^{-(a_1 x_1 + a_2 x_2 + a_3 x_3 +b)}} \end{align*}$

We should also introduce terms that represent interactions between the variables, but to keep things simple and illustrate why such terms are needed, let’s continue without them.

If we fit the above model to our data set by running gradient descent a handful of times with different initial guesses and choosing the best result, we get the following fitted model:

$\begin{align*} y &= \dfrac{1}{1 + e^{-(0.79 x_1 + 1.13 x_2 + 0.75 x_3 - 1.72)}} \end{align*}$

The Need for Interaction Terms

This model makes the following predictions. Some of them seem accurate, but others do not.

$\begin{align*} \begin{matrix} \begin{matrix} \textrm{scoops} \\ \textrm{peanut butter} \end{matrix} & \begin{matrix} \textrm{scoops} \\ \textrm{jelly} \end{matrix} & \begin{matrix} \textrm{slices} \\ \textrm{beef} \end{matrix} & \begin{matrix} \textrm{proportion} \\ \textrm{subjects liked} \end{matrix} & \textrm{prediction} \\ \hline 0 & 0 & 0 & 0.0 & 0.15 & \checkmark \\ 1 & 0 & 0 & 0.2 & 0.28 & \checkmark \\ 2 & 0 & 0 & 0.5 & 0.47 & \checkmark \\ 0 & 1 & 0 & 0.4 & 0.36 & \checkmark \\ 0 & 2 & 0 & 0.6 & 0.63 & \checkmark \\ 0 & 0 & 1 & 0.5 & 0.27 & \times \\ 0 & 0 & 2 & 0.8 & 0.44 & \times \\ 1 & 1 & 0 & 1.0 & 0.55 & \times \\ \mathbf 1 & \mathbf 0 & \mathbf 1 & \mathbf{0.0} & \mathbf{0.46} & \times \\ 0 & 1 & 1 & 0.1 & 0.54 & \times \\ \end{matrix} \end{align*}$

The weirdest inaccurate prediction (bolded above) is that the model overrates peanut butter & roast beef sandwiches. It thinks that half of the subjects will like them, when in reality, none of the subjects did. And if you try to imagine that combination of ingredients, it probably doesn’t seem appetizing.

The problem is that our model is not sophisticated enough to capture the idea that two ingredients can taste good alone but bad together (or vice versa). It’s easy to see why this is:

The logistic function $\dfrac{1}{1 + e^{-(ax+b)}}$ is increasing if $a > 0$ and decreasing if $a < 0.$
The coefficient on $x_1$ (peanut butter) is $a_1 = 1.02$ and the coefficient on $x_3$ (roast beef) is $a_3 = 1.91.$
Both of these coefficients are positive. Consequently, the higher $x_1$ (the more scoops of peanut butter), the higher the prediction will be. Likewise, the higher $x_3$ (the more slices of roast beef), the higher the prediction will be.

Interaction Terms

To fix this, we can add interaction terms that multiply two variables together. These terms will vanish unless both variables are nonzero.

$\begin{align*} y &= \dfrac{1}{1 + e^{-(a_1 x_1 + a_2 x_2 + a_3 x_3 + a_{12} x_1 x_2 + a_{13} x_1 x_3 + a_{23} x_2 x_3 +b)}} \end{align*}$

The interaction terms above are $a_{12} x_1 x_2,$ $a_{13} x_1 x_3,$ and $a_{23} x_2 x_3.$ The subscripts indicate which variables are being multiplied together.

Notice that, for example, the interaction term $a_{13}x_1 x_3$ will not have an effect on the predictions for $x_1$ (peanut butter) or $x_3$ (roast beef) in isolation, but it will have an effect when these ingredients are combined.

If we fit this model again using gradient descent, we get the following result:

$\begin{align*} y &= \dfrac{1}{1 + e^{-(1.02 x_1 + 1.34 x_2 + 1.91 x_3 + 3.82 x_1 x_2 - 4.82 x_1 x_3 - 3.34 x_2 x_3 - 2.11)}} \end{align*}$

Now, the model makes much more accurate predictions.

$\begin{align*} \begin{matrix} \begin{matrix} \textrm{scoops} \\ \textrm{peanut butter} \end{matrix} & \begin{matrix} \textrm{scoops} \\ \textrm{jelly} \end{matrix} & \begin{matrix} \textrm{slices} \\ \textrm{beef} \end{matrix} & \begin{matrix} \textrm{proportion} \\ \textrm{subjects liked} \end{matrix} & \textrm{prediction} \\ \hline 0 & 0 & 0 & 0.0 & 0.11 & \checkmark \\ 1 & 0 & 0 & 0.2 & 0.25 & \checkmark \\ 2 & 0 & 0 & 0.5 & 0.48 & \checkmark \\ 0 & 1 & 0 & 0.4 & 0.32 & \checkmark \\ 0 & 2 & 0 & 0.6 & 0.64 & \checkmark \\ 0 & 0 & 1 & 0.5 & 0.45 & \checkmark \\ 0 & 0 & 2 & 0.8 & 0.85 & \checkmark \\ 1 & 1 & 0 & 1.0 & 0.98 & \checkmark \\ 1 & 0 & 1 & 0.0 & 0.02 & \checkmark \\ 0 & 1 & 1 & 0.1 & 0.10 & \checkmark \\ \end{matrix} \end{align*}$

As a sanity check, we can also interpret the coefficients of the interaction terms:

The interaction term between $x_1$ (peanut butter) and $x_2$ (jelly) is $3.82 \, x_1 x_2.$ The positive coefficient indicates that combining peanut butter and jelly should increase the prediction.
The interaction term between $x_1$ (peanut butter) and $x_3$ (roast beef) is $-4.82 \, x_1 x_3.$ The negative coefficient indicates that combining peanut butter and roast beef should decrease the prediction.
The interaction term between $x_2$ (jelly) and $x_3$ (roast beef) is $-3.34 \, x_2 x_3.$ The negative coefficient indicates that combining jelly and roast beef should decrease the prediction.

Intuitively, this all makes sense. Peanut butter & jelly go together, but peanut butter & roast beef do not go together, and nor do jelly & roast beef.

Exercise

Implement the example that was worked out above.