Mar 20, 2012

Finite Precision and Statistical Models


If you aren't already familiar with some of the common errors in statistical modeling, I highly recommend Peter Kennedy's A Guide to Econometics.  This post is about a modeling issue that I haven't read in any book yet, but seen a couple of times in person.

The issue is caused by precision. Maybe the reason I haven't read this in a book is because precision issues are applied instead of academic concerns. Most of the analysis of statistical model fitting is done with the real number system, which leaves out precision from consideration. Still, finite precision becomes a problem for models around their steep points. Consider this S shaped curve that gives the relationship between two variables:



It is steep in some parts and very shallow in others.  Any uncertainty in the value of one variable implies an uncertainty in the value of the other variable. The uncertainty grows when the derivative of the curve is greater than 1 and shrinks when it is less than one. In other words the same amount of uncertanty in the value of the independent variable (x-value) produces different amounts of uncertanty in the value of the dependent variable (y-value). We are more uncertan about the value of the dependent variable when the curve is steep (derivative greater than 1) and more certain about its value when it is shallow. You can see this in the graph below:

Though the red and green lines span the same region on the independent variable's axis (x-axis), red spans a larger section of the dependent variable's axis (y-axis). Any uncertanty around the red region is magnified considerably. This S shape curve will be problematic because of this.

Consider a set of data that is taken from this curve plus some constant random noise. I've basically defined a homoskedastic data set which is great. But if there is any uncertanty in the independent values as we measured it, then the noise added to the dependent values effectively becomes larger where the model is steep, even if the uncertainty in the value of x is constant. Here is what the set of data looks like for 500 points if there was a Gaussian noise on the dependent values and some smaller Gaussian noise on the independent values.


The curve is so much thicker in the middle. It's heteroskedastic. If instead of Gaussian noise in the independent values we simply rounded the independent values,  we would get the same problem.


Thicker in the middle. Compare this to no noise our rounding in the independent data.


This is homoskedastic. It's an example that you would see in a textbook and is completely unrealistic compared to the first two data sets.

If we try to fit anything to the first two sets of data, the elements in the middle will have a stronger influence on the result than the rest of the data because of the increased variance in the middle of the S shape. That might not look so bad for this data set, but what about replacing our S curve with 1/x or Log(x)? With these, values close to 0 will have an incredible weight in determining the value of a least squares regression.

How many people are really even thinking about these kinds of things when they use statistical models?

Here's some Mathematica code used to generate the three sets from the plots above in order. The S curve is 10*ArcTan(x).

xsamples = RandomReal[{-5, 5}, 500]; 
firstSet =
 Transpose@{xsamples + RandomVariate[NormalDistribution[0, 0.2], 500],
    10 Tanh[xsamples] +
    RandomVariate[NormalDistribution[0, 0.2], 500]}; 
secondSet =
 Transpose@{Round[xsamples, 0.5],
   10 Tanh[xsamples] + RandomVariate[NormalDistribution[0, 0.2], 500]}; 
thirdSet =
  Transpose@{xsamples,
    10 Tanh[xsamples] +
     RandomVariate[NormalDistribution[0, 0.2], 500]};

Mar 17, 2012

The Creative Class

I keep on seeing articles and blogs about the importance of being a content creator. This is actually one of Johnathan Zittrain's big principals for the internet -  we need tech which allow us to create content and not tech which only allows us to consume it.

But more recently, curation as an alternative to content creation has become popular. Pinterest and Tumblr are both examples of this category. Hell so are search engines. Content curation is possible with many more kinds of devices than creation. It can be done using mobile devices and takes advantage passive interaction (Read Wu-Wei)  such as page views.

The shift to curation is in part a response to a saturation of information on the internet. Search engines are data curation. I am hardly qualified to create new content that is worth much except in a few small areas. This blog for example is fairly worthless.

A bit more on the dark side, maybe we should flip how we think about curation from collecting good content to destroying bad content. We should be talking about content destruction. People are reluctant to see that more content is often destructive. Curation is only useful because it filters out worthless data.

Worthless data we keep on ourselves can also come back to hurt us later through data mining.