If you aren't already familiar with some of the common errors in statistical modeling, I highly recommend Peter Kennedy's A Guide to Econometics. This post is about a modeling issue that I haven't read in any book yet, but seen a couple of times in person.
The issue is caused by precision. Maybe the reason I haven't read this in a book is because precision issues are applied instead of academic concerns. Most of the analysis of statistical model fitting is done with the real number system, which leaves out precision from consideration. Still, finite precision becomes a problem for models around their steep points. Consider this S shaped curve that gives the relationship between two variables:
Though the red and green lines span the same region on the independent variable's axis (x-axis), red spans a larger section of the dependent variable's axis (y-axis). Any uncertanty around the red region is magnified considerably. This S shape curve will be problematic because of this.
Consider a set of data that is taken from this curve plus some constant random noise. I've basically defined a homoskedastic data set which is great. But if there is any uncertanty in the independent values as we measured it, then the noise added to the dependent values effectively becomes larger where the model is steep, even if the uncertainty in the value of x is constant. Here is what the set of data looks like for 500 points if there was a Gaussian noise on the dependent values and some smaller Gaussian noise on the independent values.
The curve is so much thicker in the middle. It's heteroskedastic. If instead of Gaussian noise in the independent values we simply rounded the independent values, we would get the same problem.
Thicker in the middle. Compare this to no noise our rounding in the independent data.
This is homoskedastic. It's an example that you would see in a textbook and is completely unrealistic compared to the first two data sets.
If we try to fit anything to the first two sets of data, the elements in the middle will have a stronger influence on the result than the rest of the data because of the increased variance in the middle of the S shape. That might not look so bad for this data set, but what about replacing our S curve with 1/x or Log(x)? With these, values close to 0 will have an incredible weight in determining the value of a least squares regression.
How many people are really even thinking about these kinds of things when they use statistical models?
Here's some Mathematica code used to generate the three sets from the plots above in order. The S curve is 10*ArcTan(x).
xsamples = RandomReal[{-5, 5}, 500];
firstSet =
Transpose@{xsamples + RandomVariate[NormalDistribution[0, 0.2], 500],
10 Tanh[xsamples] +
RandomVariate[NormalDistribution[0, 0.2], 500]};
secondSet =
Transpose@{Round[xsamples, 0.5],
10 Tanh[xsamples] + RandomVariate[NormalDistribution[0, 0.2], 500]};
thirdSet =
Transpose@{xsamples,
10 Tanh[xsamples] +
RandomVariate[NormalDistribution[0, 0.2], 500]};