Posts tagged with statistics

Big, long cycle = trend.

## An analytic formula for the median.

Three observations get you there:

1. `min {a,b,c} = − max {−a, −b, −c}`
2. `second-from-top {a,b,c,d,e} = max ( {a,b,c,d,e} without max{a,b,c,d,e} )`
3. `max {a,b,c} ~ log_t (t^a + t^b + t^c ),   t→∞`

Putting these three together you can make a continuous formula approximating the median. Just subtract off the ends until you get to the middle.

It’s ugly. But, now you have a way to view the `sort` operation—which is discontinuous—in a “smooth” way, even if the smudging/blurring is totally fabricated. You can take derivatives, if that’s something you want to do. I see it as being like q-series: wriggling out from the strictures so the fixed becomes fluid.

If the astronomical observations and other quantities on which the computation of orbits were absolutely correct, the elements also, whether deduced from three or four observations, would be strictly accurate (so far indeed as the motion is supposed to take place exactly according to the laws of Kepler), and, therefore, if other observations were used, they might be confirmed but not corrected.

But since all our measurements and observations are nothing more than approximations to the truth, the same must be true of all calculations resting upon them, and the highest aim of all computations made concerning concrete phenomena must be to approximate, as nearly as practicable, to the truth. But this can be accomplished in no other way than by a suitable combination of more observations than the number absolutely requisite for the determination of the unknown quantities. This problem can only be properly understood when an approximate knowledge of the orbit has been already attained, which is afterwards to be corrected so as to satisfy all the observations in the most accurate manner possible.

Johann Carl Friedrich Gauß, Theoria Motus Corporum Cœlestium in Sectionibus Conicis solem Ambientium, 1809

(translation by C.H. Davis 1963)

(Source: cs.unc.edu)

1. As Dan Davies observed (from memory) The Great Depression really happened; it wasn’t just an unusually inaccurate observation of an underlying 4% return on equities”

2. Why do we assume errors have zero mean?  …the mean of the residuals is not identifiable separately from the intercept, and we just choose the parametrization that has mean-zero residuals. In that situation it’s not an assumption and couldn’t be falsified empirically.

## Dummyisation

Statisticians are crystal clear on human variation. They know that not everyone is the same. When they speak about groups in general terms, they know that they are reducing N-dimensional reality to a 1-dimensional single parameter.

Nevertheless, statisticians permit, in their regression models, variables that only take on one value, such as `{0,1}` for `male/female` or `{a,b,c,d}` for `married/never-married/divorced/widowed`.

No one doing this believes that all such people are the same. And anyone who’s done the least bit of data cleaning knows that there will be `NA`'s, wrongly coded cases, mistaken observations, ill-defined measures, and aberrances of other kinds. It can still be convenient to use binary or n-ary dummies to speak simply. Maybe the marriages of some people coded as `currently married` are on the rocks, and therefore they are more like `divorced`—or like a new category of people in the midst of watching their lives fall apart. Yes, we know. But what are you going to do—ask respondents to rate their marriage on a scale of one to ten? That would introduce false precision and model error, and might put respondents in such a strange mood that they answer other questions strangely. Better to just live with being wrong. Any statistician who uses the `cut` function in R knows that the variable didn’t become basketed←continuous in reality. But a `facet_wrap` plot is easier to interpret than a 3D wireframe or cloud-points plot.

To the precise mind, there’s a world of difference between saying

• "the mean height of men > the mean height of women", and saying
• "men are taller than women".

Of course one can interpret the second statement to be just a vaguer, simpler inflection of the first. But some people understand  statements like the second to mean “each man is taller than each woman”. Or, perniciously, they take “Blacks have lower IQ than Whites” to mean “every Black is mentally inferior to every White.”

I want to live somewhere between pedantry and ignorance. We can give each other a break on the precision as long as the precise idea behind the words is mutually understood.

` `

Dummyisation is different to stereotyping because:

• stereotypes deny variability in the group being discussed
• dummyisation acknowledges that it’s incorrect, before even starting
• stereotyping relies on familiar categories or groupings like skin colour
• dummyisation can be applied to any partitioning of a set, like based on height or even grouped at random

It’s the world of difference between taking on a hypotheticals for the purpose of reaching a valid conclusion, and bludgeoning someone who doesn’t accept your version of the facts.

So this is a word I want to coin (unless a better one already exists—does it?):

• dummyisation is assigning one value to a group or region
• for convenience of the present discussion,
• recognising fully that other groupings are possible
• and that, in reality, not everyone from the group is alike.
• Instead, we apply some ∞→1 function or operator on the truly variable, unknown, and variform distribution or manifold of reality, and talk about the results of that function.
• We do this knowing it’s technically wrong, as a (hopefully productive) way of mulling over the facts from different viewpoints.

In other words, dummyisation is purposely doing something wrong for the sake of discussion.

hi-res

There’s a paper in PNAS suggesting that lots of published scientific associations are likely to be false, and that Bayesian considerations imply a p-value threshold of 0.005 instead of 0.05 would be good. It’s had an impact outside the statistical world, eg, with a post on … Ars Technica…

3. If … you think p-value thresholds should be a publishing criterion, you’ve got worse problems than reproducibility.

4. False negatives are errors, too.  People already report “there was no association between X and Y ” (or worse “there was no effect of X on Y”) in subgroups where the p-value is greater than 0.05.  If you have the same data and decrease the false positives you have to increase the false negatives.

5. The problem isn’t the threshold so much as the really weak data in a lot of research, …. Larger sample sizes or better experimental designs would actually reduce the error rate; moving the threshold only swaps which kind of error you make.

7. And finally, why is it a disaster that a single study doesn’t always reach the correct answer? Why would any reasonable person expect it to? It’s not as if we have to ignore everything except the results of that one experiment in making any decisions.

HT @zentree

Over a year ago, I wrote a letter to the editor of the Journal of Computational Sciences, urging the retraction of Bollen, Mao, and Zeng’s paper, “Twitter Mood Predicts the Stock Market.” Since JoCS is an Elsevier journal, one does not simply email the editor.

Rather, one has to register with the Elsevier author system, … submit `LaTeX` source code of a letter, along with supporting documents, author bio, .… I distilled the main arguments into two:

1. first, that the Granger causality tests presented in BMZ’s paper are … datamining, and present no evidence for a connection between Twitter and the Dow Jones Index;
2. and that the quoted predictive accuracy of the forecast model is so high, it would … [contradict] the experiences of … [traders] … and so this forecast accuracy is likely to be erroneously reported.
I included references to BMZ’s failed attempts to commercialize their patented techniques with Derwent.

Following the strictest protocol, the editor of JoCS duly sent this letter to reviewers . After roughly seven months, …

The reviewers’ comments were more than fair. If my arguments were unclear, I was more than happy to reword them and provide additional evidence to get my point across. So I edited my letter to the editor, and re-sent it. …

…within two months or so (the equivalent of overnight in journal-time), the editor sent me a rejection notice with … review, quoted below. This review—this review is sensational. As one afflicted with Hamlet Syndrome, I admire Reviewer #4’s conviction. As someone too often in search of the right phrase to dismiss a crap idea, I take delight in Reviewer #4’s acid pen: I have never seen a reviewer so viciously shit-can a paper before. Reviewer #4 tore my letter to pieces, then burned the pieces. Then poured lye on the ashes. Then salted the earth where the lye sizzled. Then burnt down the surrounding forest, etc.

Fun coursera on virology.

• Viruses are so numerous (10³⁰) and filling up everywhere. It gives this Boltzmann flavour of ‘enough stuff” to really do statistics on.

• Viruses are just a bundle of `{proteins, lipids, nucleic acids}` with a shell. It’s totally value-free, no social Darwinism or “survival of the fittest” being imbued with a moral colour. Just a thing that happened that can replicate.
• Maybe this is just because I was reading about nuclear spaces (⊂ topological vector spaceand white-noise processes that I think of this. Viruses have a qualitatively different error structure than Gaussian. Instead of white-noise it’s about if they can get past certain barriers, like:
• survive out in the air/water/cyanide
• bind to a DNA