Multiple Regression

Example 1 in The Regression Line presented a regression equation for predicting income from education. This is a good way to describe the cross-sectional relationship between income and education. But it may not be legitimate to interpret the slope of $1,200 per year in a cause-and-effect way: namely, that if you send people on to school for another year, that will make them earn another $1,200, on average. The reason is that this is not a controlled experiment. The effects of other variables may be confounded with the effects of education. The people in this sample who went to college may have been a very different group than the one that didn't go to college. Maybe it's the really bright kids that go to college and it's the dumb ones that don't go (to put it overly simply). Then it's possible that those college kids would have made more money than the others even if they hadn't gone to college: they are just smart and would do well at anything they did.

Many investigators would use multiple regression to control for the other variables. For instance, they might develop some measure for the intelligence of each person, and fit a multiple regression equation of the form

Y = a + bE + cI

where

Y = predicted income, E = educational level,
S = measure of intelligence

The coefficient b would be interpreted as showing the effect of education, controlling for the effect of intelligence. Similarly, the coefficient c would be interpreted as showing the effect of Intelligence on income, controlling for education.

This can give sensible results. But it can produce nonsense as well. Take the hypothetical investigator who was working on the area of rectangles in The Least Squares Approach handout. She could decide to control for the shape of the rectangles by multiple regression, using the length of the diagonal to measure shape. (Of course, this isn't a very good measure of shape; but then nobody knows how to measure intelligence very well either.) The investigator would fit a multiple regression equation of the form

predicted area = a + b·perimeter + c·diagonal

She might tell herself that b measures the effect of perimeter, controlling for the effect of shape. As a result, she would be even more confused than before. The perimeter and diagonal do determine the area, but only by a complicated nonlinear formula. Multiple regression is a powerful technique, but it is not a substitute for understanding what's going on! In this case, area is a function of length times width. So those two variables should be in the equation, and they should be multiplied by each other, like this:

area = length × width

Note that in multiple regression, the various independent variables are added together, not multiplied or divided. So, to test the multiplicative model above you would have to do one of two things. (a) you can multiply the two variables together first, then put them in the regression. For example:

LW = length × width

area = mLW + b

Or (b), you can take the log of both sides of the equation:

log(area) = log(length) + log(width)

Isn't that special?

Drawn from Statistics, by Freedman, Pisani, Purves and Adhikari