I’ll go into more technicality about robust data analysis elsewhere. Here I want to put forward the simplest argument for it. (This is repeated probably verbatim from Karen Kafadar.)

Say you have 5 independent estimates. Estimates of something important and you’re going to make an important decision based on what the true story is. These numbers are in thousands of dollars, because it’s important.

  • $77,010 k
  • $76,778 k
  • $79,8344 k
  • $78,652 k
  • $78,136

Oops, there is a typo but you don’t notice that. Having read The Wisdom of Crowds and the Central Limit Theorem you naturally average these estimates together to cancel out possible biases or inaccuracies.  This is part of a much larger project with many more numbers (which is why you didn’t notice the typo) and you using common sense on the numbers, just plugging them into your analytic tools.

  • Result: $221,800 k.
    Due to your unnoticed typo, the analysis is majorly wrong, the decision that follows on it is majorly wrong, and everybody loses.

Let’s say you had used the median instead of the mean. Nobody tells you in Stat 101 that the median is much more robust, nor do they talk about trimming, letter-value plots, tri-means, five-number summary, etc.

  • Result: $78,140 k

Yes, regardless of the typo the result is broadly correct. Correct data would have shown mean ± SD of $78,080 k ± $624 k so the median is in bounds.

Of course, I just made this data up — but pick a distribution and generate 100 random numbers with it, then inject an extra digit into one or more of them and see the results on the mean and on the median. You can analyze the differences with calculus but I think the intuition is obvious enough that I can just leave it there.

It’s unbelievable that I didn’t learn these methods until graduate school. Undergraduate journalism majors are taught beta’s, p-values, null hypothesis versus alternative hypothesis, and theoretical “samples” “populations” and “experiments”. But they don’t do, like, simple, common sense data analysis. Just poking around without heavy math tools to ask natural questions.

After a year of econometrics (regressions, F-tests, matrix algebra to talk about heteroscedasticity, probit, logit, instrumentation, calculus of variations), a year of proofs about probability theory and stochastic calculus, and the regular sequence of prob/stat that lots of people take…

I didn’t understand what a residual was.

It’s obvious using robust statistics. You aren’t so juiced up on the fancy method because it’s dead simple, so you know there will be unexplained error left over. You see the statistical method as more like

  1. correcting for known factors,
  2. plotting the data on a reasonable axis, and
  3. centering it.

Outliers are removed and examined in a separate analysis — maybe one that doesn’t involve statistical tricks.

That’s analysis I can believe in.

14 notes

  1. oculus-supra reblogged this from isomorphismes
  2. isomorphismes posted this