Posts tagged with statistics

Statistics vs Machine-Learning
by Rob Tibshirani


Big, long cycle = trend.

Three observations get you there:

  1. min {a,b,c} = − max {−a, −b, −c}
  2. second-from-top {a,b,c,d,e} = max ( {a,b,c,d,e} without max{a,b,c,d,e} )
  3. max {a,b,c} ~ log_t (t^a + t^b + t^c ),   t→∞

Putting these three together you can make a continuous formula approximating the median. Just subtract off the ends until you get to the middle.

It’s ugly. But, now you have a way to view the sort operation—which is discontinuous—in a “smooth” way, even if the smudging/blurring is totally fabricated. You can take derivatives, if that’s something you want to do. I see it as being like q-series: wriggling out from the strictures so the fixed becomes fluid.

If the astronomical observations and other quantities on which the computation of orbits were absolutely correct, the elements also, whether deduced from three or four observations, would be strictly accurate (so far indeed as the motion is supposed to take place exactly according to the laws of Kepler), and, therefore, if other observations were used, they might be confirmed but not corrected.

But since all our measurements and observations are nothing more than approximations to the truth, the same must be true of all calculations resting upon them, and the highest aim of all computations made concerning concrete phenomena must be to approximate, as nearly as practicable, to the truth. But this can be accomplished in no other way than by a suitable combination of more observations than the number absolutely requisite for the determination of the unknown quantities. This problem can only be properly understood when an approximate knowledge of the orbit has been already attained, which is afterwards to be corrected so as to satisfy all the observations in the most accurate manner possible.

Johann Carl Friedrich Gauß, Theoria Motus Corporum Cœlestium in Sectionibus Conicis solem Ambientium, 1809

(translation by C.H. Davis 1963)



1. As Dan Davies observed (from memory) The Great Depression really happened; it wasn’t just an unusually inaccurate observation of an underlying 4% return on equities”

2. Why do we assume errors have zero mean?  …the mean of the residuals is not identifiable separately from the intercept, and we just choose the parametrization that has mean-zero residuals. In that situation it’s not an assumption and couldn’t be falsified empirically.

Statisticians are crystal clear on human variation. They know that not everyone is the same. When they speak about groups in general terms, they know that they are reducing N-dimensional reality to a 1-dimensional single parameter.

Nevertheless, statisticians permit, in their regression models, variables that only take on one value, such as {0,1} for male/female or {a,b,c,d} for married/never-married/divorced/widowed.
No one doing this believes that all such people are the same. And anyone who’s done the least bit of data cleaning knows that there will be NA's, wrongly coded cases, mistaken observations, ill-defined measures, and aberrances of other kinds. It can still be convenient to use binary or n-ary dummies to speak simply. Maybe the marriages of some people coded as currently married are on the rocks, and therefore they are more like divorced—or like a new category of people in the midst of watching their lives fall apart. Yes, we know. But what are you going to do—ask respondents to rate their marriage on a scale of one to ten? That would introduce false precision and model error, and might put respondents in such a strange mood that they answer other questions strangely. Better to just live with being wrong. Any statistician who uses the cut function in R knows that the variable didn’t become basketed←continuous in reality. But a facet_wrap plot is easier to interpret than a 3D wireframe or cloud-points plot.

To the precise mind, there’s a world of difference between saying

  • "the mean height of men > the mean height of women", and saying
  • "men are taller than women".


Of course one can interpret the second statement to be just a vaguer, simpler inflection of the first. But some people understand  statements like the second to mean “each man is taller than each woman”. Or, perniciously, they take “Blacks have lower IQ than Whites” to mean “every Black is mentally inferior to every White.”

I want to live somewhere between pedantry and ignorance. We can give each other a break on the precision as long as the precise idea behind the words is mutually understood.


Dummyisation is different to stereotyping because:

  • stereotypes deny variability in the group being discussed
  • dummyisation acknowledges that it’s incorrect, before even starting
  • stereotyping relies on familiar categories or groupings like skin colour
  • dummyisation can be applied to any partitioning of a set, like based on height or even grouped at random

It’s the world of difference between taking on a hypotheticals for the purpose of reaching a valid conclusion, and bludgeoning someone who doesn’t accept your version of the facts.

So this is a word I want to coin (unless a better one already exists—does it?):

  • dummyisation is assigning one value to a group or region
  • for convenience of the present discussion,
  • recognising fully that other groupings are possible
  • and that, in reality, not everyone from the group is alike.
  • Instead, we apply some ∞→1 function or operator on the truly variable, unknown, and variform distribution or manifold of reality, and talk about the results of that function.
  • We do this knowing it’s technically wrong, as a (hopefully productive) way of mulling over the facts from different viewpoints.

In other words, dummyisation is purposely doing something wrong for the sake of discussion.

Objectively the best recommendation of a statistics book ever.



There’s a paper in PNAS suggesting that lots of published scientific associations are likely to be false, and that Bayesian considerations imply a p-value threshold of 0.005 instead of 0.05 would be good. It’s had an impact outside the statistical world, eg, with a post on … Ars Technica…

3. If … you think p-value thresholds should be a publishing criterion, you’ve got worse problems than reproducibility.

4. False negatives are errors, too.  People already report “there was no association between X and Y ” (or worse “there was no effect of X on Y”) in subgroups where the p-value is greater than 0.05.  If you have the same data and decrease the false positives you have to increase the false negatives. 

5. The problem isn’t the threshold so much as the really weak data in a lot of research, …. Larger sample sizes or better experimental designs would actually reduce the error rate; moving the threshold only swaps which kind of error you make.

7. And finally, why is it a disaster that a single study doesn’t always reach the correct answer? Why would any reasonable person expect it to? It’s not as if we have to ignore everything except the results of that one experiment in making any decisions.  

HT @zentree