Mar 20, 2012

Finite Precision and Statistical Models

If you aren't already familiar with some of the common errors in statistical modeling, I highly recommend Peter Kennedy's A Guide to Econometics.  This post is about a modeling issue that I haven't read in any book yet, but seen a couple of times in person.

The issue is caused by precision. Maybe the reason I haven't read this in a book is because precision issues are applied instead of academic concerns. Most of the analysis of statistical model fitting is done with the real number system, which leaves out precision from consideration. Still, finite precision becomes a problem for models around their steep points. Consider this S shaped curve that gives the relationship between two variables:

It is steep in some parts and very shallow in others.  Any uncertainty in the value of one variable implies an uncertainty in the value of the other variable. The uncertainty grows when the derivative of the curve is greater than 1 and shrinks when it is less than one. In other words the same amount of uncertanty in the value of the independent variable (x-value) produces different amounts of uncertanty in the value of the dependent variable (y-value). We are more uncertan about the value of the dependent variable when the curve is steep (derivative greater than 1) and more certain about its value when it is shallow. You can see this in the graph below:

Though the red and green lines span the same region on the independent variable's axis (x-axis), red spans a larger section of the dependent variable's axis (y-axis). Any uncertanty around the red region is magnified considerably. This S shape curve will be problematic because of this.

Consider a set of data that is taken from this curve plus some constant random noise. I've basically defined a homoskedastic data set which is great. But if there is any uncertanty in the independent values as we measured it, then the noise added to the dependent values effectively becomes larger where the model is steep, even if the uncertainty in the value of x is constant. Here is what the set of data looks like for 500 points if there was a Gaussian noise on the dependent values and some smaller Gaussian noise on the independent values.

The curve is so much thicker in the middle. It's heteroskedastic. If instead of Gaussian noise in the independent values we simply rounded the independent values,  we would get the same problem.

Thicker in the middle. Compare this to no noise our rounding in the independent data.

This is homoskedastic. It's an example that you would see in a textbook and is completely unrealistic compared to the first two data sets.

If we try to fit anything to the first two sets of data, the elements in the middle will have a stronger influence on the result than the rest of the data because of the increased variance in the middle of the S shape. That might not look so bad for this data set, but what about replacing our S curve with 1/x or Log(x)? With these, values close to 0 will have an incredible weight in determining the value of a least squares regression.

How many people are really even thinking about these kinds of things when they use statistical models?

Here's some Mathematica code used to generate the three sets from the plots above in order. The S curve is 10*ArcTan(x).

xsamples = RandomReal[{-5, 5}, 500]; 
firstSet =
 Transpose@{xsamples + RandomVariate[NormalDistribution[0, 0.2], 500],
    10 Tanh[xsamples] +
    RandomVariate[NormalDistribution[0, 0.2], 500]}; 
secondSet =
 Transpose@{Round[xsamples, 0.5],
   10 Tanh[xsamples] + RandomVariate[NormalDistribution[0, 0.2], 500]}; 
thirdSet =
    10 Tanh[xsamples] +
     RandomVariate[NormalDistribution[0, 0.2], 500]};