The Crusade Against Multiple Regression Analysis

  • One of the must-read papers in this area is a very short, very readable piece by Chris Achen called "Let's Put Garbage-Can Regressions and Garbage-Can Probits Where They Belong." [0]

    The basic idea is that if your data experiences non-linear coding errors due to some external process before it gets to you, then as related to the true causal relationship between your variables, you're actually fitting a (possibly) non-linear transformation.

    If you do this and still use the classical cook-book t-stat / p-value stuff, you can run into big problems. As the paper shows, even for an extremely simple data set where there is merely a non-linear coding error, the coefficient can be very statistically significant (t-stat magnitude > 2) and yet it will have the wrong sign.

    This is a really remarkable thing. By just jittering your data a bit, you can obtain a result that passes the usual, simplistic significance tests but for which the effect size is negative when it should be positive (or vice versa).

    I remember when I worked in quant finance and we were using a terrible, shamefully bad automated framework for frequentist model fitting, the sense of dread was huge. If the data we used had been jittered by the data vendor, or if there was a slight bug in the code that did variable cleaning, outlier manipulation, or scoring, or any of a hundred other steps where tiny non-linearities could be introduced, then it would be entirely possible for us to see "significant" results that pointed in the opposite direction from the truth.

    There are many other papers which point out the pitfalls of naive dump-it-all-on-the-right-hand-side regression models, but I think this one from Achen is unique in that it is extremely short, extremely simple to follow, and yet it is fully devastating to this entire technique of modeling.

    [0] http://www.columbia.edu/~gjw10/achen04.pdf

  • One interesting point about regression analysis though is that its outcome is perfectly fit to be exploited by marketing & advertising. explaining all the caveats of an experiment is hard, but saying that "people who eat X have less probability of getting cancer" is very powerful. So, in a way, I suspect these type of studies (regression) are "pulled" by marketing rather than pusghed by Research (at least for corporate sponsored studies)

  • Does factor analysis/principal component analysis help any? The article begins by talking about how looking at a single variable (vitamin E) that is strongly correlated with a bunch of other variables (healthy lifestyle) points one in the wrong direction. If you attempt to capture "all" (I know - not a priori possible) the relevant variables, are you going to do better?