Another temptation when faced with the availability of massive amounts of disparate data
is to throw in a large number of controls in regression analysis to find ‘true’ relationships
in the data. Doing so can strengthen the analysis’ internal validity by ‘netting out’ of the
effects of other endogenous factors on the dependent variable, but it brings about several
potential challenges. First, it downplays the fact that combining data from multiple
sources may also mean “magnifying”
66
their flaws. If the results are contingent on factors
specific to the unit of analysis, they also become hard to generalise to other settings with
vastly different mean values of the included controls: external validity is then weakened.
There is another econometric downside to throwing in a large number of controls
indiscriminately. If many of them are correlated, the resulting multicollinearity will lead
to non-unique parameter fits and more or less arbitrary parameter choices, meaning that
the results may be misleading. This suggests that theory and context matter even (or,