Posts tagged with machine learning

Michael Conover: Information Visualization for Large-Scale Data Workflows

  • data geometry
  • memes
  • visual analysis of program structure
  • visual analysis of propaganda
  • image
  • compare last week’s analysis and share with colleagues
  • geom_bin2d rather than geom_point(alpha=...) in ggplot2
  • ggpairs
  • automated grading: in addition to unit testing, 1) parse syntax trees of submissions, 2) define edit distance between them, 3) induces a network structure, 4) identify clusters, 5) give feedback to a representative member of the cluster and cc: everyone else

Tyler Cowen says that super hackers will benefit from improving computer technology and reap the high wages of the post-recession economy.

I’m sorry to say I too have used the lazy robo-programmers metaphor. That was uncareful non-thinking on my part.

Trying to be more logical, what should we really conclude from the assumption that observed ↑ growth in “computer stuff” will continue apace?

Read More

Although partial least squares regression was not designed for classification and discrimination, it is … used for these purposes. For example, PLS has been used to:

  • • distinguish Alzheimer’s, senile dementia of the Alzheimer’s type, and vascular dementia
  • • discriminate between Arabica and Robusta coffee beans
  • • classify waste water pollution
  • • separate active and inactive compounds in a quantitative structure-activity relationship study
  • • differentiate two types of hard red wheat using near-infrared analysis
  • • distinguish transsexualism, borderline personality disorder and controls using a standard instrument
  • • determine the year of vintage port wine
  • • classify soy sauce by geographic region
  • • determine emission sources in ambient aerosol studies
Matthew Barker and William Rayens


Simulated Annealing

Just some things that have already been attempted with statistical text-mining — from politics to Latin.

Which pair is more different?

  • keyboard | keyb`ard
  • keyboard | keybpard
  • keyboard | keebored

Of course in mathematics we get to decide among many definitions of size and there is no “correct” answer. Just what suits the application.

I can think of two approaches to defining distance measures between words:

  • sound-based — d(Hirzbruch, Hierzebrush) < d(Hirzbruch, Hirabruc)
  • keyboard-based — d(u,y) < d(u,o)

Reading on online fora (including YCombinator, tisk tisk) the only distance functions I hear about are the ones with Wikipedia pages: Hamming distance and Levenshtein distance.

These are defined in terms of how many word-processing operations are required to correct a mis-typed word.

  • How many letters do I need to insert?
  • How many letters do I need to delete?
  • How many letter-pairs do I need to swap?
  • How many vim keystrokes do I need?

and so on—those kinds of ideas.

inter-letter interaction effects

If we could get conditional probabilities of various kinds of errors — like

  • Am I more likely to mis-type ous while writing
    • varoius
    • precarious
    • imperious
  • ? There could be some kind of finger- or hand-based reason, like if I’ve just been using right-handed fingers near my ous fingers, or that I have to angle my hand weirdly in order to hit the previous couple strokes in some other word?
  • Am i more likely to mis-type reflexive as reflexible when the document topic is gymnastics?
  • Am i more likely to make a typo in google if I’m typing fast?
  • What if you can catch me mis-placing my hand on the homerow/ how dp upi apwaus fomd tjos crazu stiff? That’s almost like just one error. (It’s certainly less distance from the real sentence than a random string of characters of equal length.)
  • Or if I click the mouse in the wrong place before correcting my spelling? d(Norschwanstein, Ndorschwanstein) or d(rehabilitation, rehabitatiilon)
  • Am i more likely to isnert a common cliche rather than what i actually mean after a word that begins a common cliche/

A Bit Of  Forensics

EDIT: Once I got about halfway throguh this article, I stopped correcting my typoes, so you can see the kind that I make. I was typing on a flat keyboard, asymmetrically holding a smallish non-Mac laptop (bigger than an Eee) with my elbows out, head down — except when I type fast and interchange letters, with perfect posture, “playing the piano” with my ten finger muscles rather than moving my wrists — at an ergonomic keyboard with a broken M. I actually don’t recall which way i wrote this article. I may hav eeven written it in shifts.

Here are some nice ones as well. Look at the comments section. By the posting times (and text) you can see that the debate was feverish—no time for corrections and the correspondents were steamed up emotionally. Their typoes really have personalities—for example Kien makes a lot of errors with his right middle finger moving up. (did → dud, is → us, promoted → promotied, inquisition → iquisition, mean → meaqn, Church → Chruch, because → becuase, Copernican → Ceprican, your → you, clearly → cleary) but also some errors of spelling with no sound-distance (Pythagoras → Pythagorus) and uses both the sounds disingenious and disingenuous. Letter-switching, ilke I do, is common; a few fat-fingers (meaqn) or forgotten letters, but this iou stuff seems unusual and possibly characteristic of something.

Other participants make different sorts of errors, or at least with different frequencies (they’re relatively more likely to omit or switch letters than to use the wrong letter, for example). But let’s just focus on Ken because so many errors of the typoes are localised to that right middle finger. I wonder if Ken has a problem with that finger? Or maybe his keyboard is shaped in such a way that it’s difficult to correctly strike those keys specifically? (Maybe certain ergonomic keyboards would fit this — or an Eee Pc with the elbows out and “pigeon-toed” hands. But why would the errors then be localised to the right middle finger? It’s more mobile than pinky & ring fingers and we’re not taught to stick it to the homerow like the index finger.) I rule out the theory that his right hand hovers above the keyboard rather than sitting on the homerow because then he should make similar errors with yuiop and maybe bnm,.hjkl; as well. Also, notice that he doesn’t make comparable errors with ewr as with iou. How do we know he sits symmetrically? I have a tough time deciphering why there are more errors with that finger on a first read-through.

We could find more of Ken’s writing here and see how he types when he’s less agitated. I bet there are no Ceprican's there but Pythagorus would still be. As for Chruch? Hmmm. Don’t know.

Big Data vs Models

Now the big-data-ists (the other half of Leo Breiman’s partition of statistical modellers -vs- data miners) would probably say “Google has a jillion search results including measurements of people correcting themselves and including time series of the letters people type — so just throw some naive Bayes at that pile and watch it come to the correct answer!” Maybe they’re right.

If someone wants to mess around with this stuff with me — leave me a comment. We could grab tweets and analyse typoes within differnet text-…[by which tool] was used to send the tweet. For example the Twitter website means it was keyboard-typed, certain mobile devices have Swype, other errors we might be able to guess tha tis …[that it’s] a T9 mobile keyboard.

  • Could we tell if a person is left-handed by their keyboard mistkaes?
  • Could we guess their education level/
  • Could we tell what tweeting platform they used by their errors rather than by 
  • Could we tell where they’re from? Or any other stalky information that advertisers/HR want to know but web browsers want to hide about themselves? (Say goodbye to mandatory drug testing in the workplace, say hello to your boss getting an email when a statistics company that monitors your twitter feed guesses you smoked pot last night based on the spelling and timing of your Facebook posts.)

I feel vindicated in several ways by the Netflix Engineering team’s recent blog post explaining what they did with the results of the Netflix Prize. What they wrote confirms what I’ve been saying about recommendations as well as my experience designing recommendation engines for clients, in several ways:

  1. Fancy ML techniques don’t matter so much. The winning BellKor/Pragmatic Chaos teams implemented ensemble methods with something like 112 techniques smushed together. You know how many of those the Netflix team implemented? Exactly two: RBM's and SVD.

    If you’re a would-be internet entrepreneur and your idea relies on some ML but you can’t afford a quant to do the stuff for you, this is good news. Forget learning every cranny of research like Pseudo-Markovian Multibagged Quantile Dark Latent Forests! You can watch an hour-long video on OCW by Gilbert Strang which explains SVD and two hour-long Google Tech Talks by Geoff Hinton on RBM’s. RBM’s are basically a superior subset of neural network with a theoretical basis why it’s superior. SVD is a dimension reduction technique from linear algebra. (There are many Science / Nature papers on dimension reduction in biology; if you don’t have a licence there are paper-request fora on Reddit.)

    Not that I don’t love reading about awesome techniques, or that something other than SVD isn’t sometimes appropriate. (In fact using the right technique on the right portion of the problem is valuable.) What Netflix people are telling us is that, in terms of a Kaggleistic one-shot on the monolithic data set, the diminishing marginal improvements to accuracy from a mega-ensemble algo don’t count as useful knowledge.

  2. Domain knowledge trumps statistical sophistication. This has always been the case in the recommendation engines I’ve done for clients. We spend most of our time trying to understand the space of your customers’ preferences — the cells, the topology, the metric, common-sense bounds, and so on. You can OO program these characteristics. And (see bottom) doing so seems to improve the ML result a lot.

    Another reason you’re probably safe ignoring the bleeding edge of ML research is that most papers develop general techniques, test them on famous data sets, and don’t make use of domain-specific knowledge. You want a specific technique that’s going to work with your customers, not a no-free-lunch-but-optimal-according-to-X academic algorithm. Some Googlers did a sentiment-analysis paper on exactly this topic: all of the text analysis papers they had looked at chose not to optimise on specific characteristics (like keywords or text patterns) known to anyone familiar with restaurant-review data. They were able to achieve a superior solution to that particular problem without fancy new maths, only using common sense and exploration specific to their chosen domain (restaurant reviews).

  3. What you measure matters more than what you squeeze out of the data. The reason I don’t like* Kaggle is that it’s all about squeezing more juice out of existing data. What Netflix has come to understand is that it’s more important to phrase the question differently. The one-to-five-star paradigm is not going to accurately assess their customers’ attitudes toward movies. The similarity space is more like Dr Hinton’s reference to a ten-dimensional library where neighbourhood relationships don’t just go along a Dewey Decimal line but also style, mood, season, director, actors, cinematography, and yes the “People like you” metric (“collaborative filtering”, a spangled bit of jargon).

    For them the preferences evolve fairly quickly over time. That has to make it hard. If your users’ preferences evolve over time: good luck, it may be quite hard.

    John Wilder Tukey: "To statisticians, hubris should mean the kind of pride that fosters an inflated idea of one’s powers and thereby keeps one from being more than marginally helpful to others. … The feeling of “Give me (or more likely even, give my assistant) the data, and I will tell you what the real answer is!” is one we must all fight against again and again, and yet again." via John D Cook 

Relatedly, a friend of mine who’s doing a Ph.D. in complexity (modularity in Bayesian networks) has been reading the Kaggle fora from time to time. His observation of the Kaggle winners is that they usually win with gross assumptions about either the generating process or the underlying domain. Basically they limit the ML search using common sense and data exploration; that gives them a significant boost in performance (1−AUC).

* I admire @antgoldbloom for following through on his idea and I do think they have a positive impact on the world. Which is much better than the typical “Someone should make X, that would be a great business” or even worse but still typical: "I’ve been saying they should have that!” Still, I do hold to my one point of critique: there’s no back-and-forth in Kaggle’s optimisation.

visualisation of how the kernel trick makes a non-separable collection of points linearly separable.

I guess the kernel mappings really add a dimension, rather than replacing a dimension, don’t they.

Upon my return [to academia, after years of private statistical consulting], I started reading the Annals of Statistics … and was bemused. Every article started with:

Assume that the data are generated by the following model…

followed by mathematics exploring inference, hypothesis testing, and asymptotics…. I [have a] very low … opinion … of the theory published in the Annals of Statistics. [S]tatistics [is] a science that deals with data.

The linear regression model led to many erroneous conclusions that appeared in journal articles waving the 5% significance level without knowing whether the model fit the data. Nowadays, I think most statisticians will agree that this is a suspect way to arrive at conclusions.

In the mid-1980s … A new research community … sprang up. Their goal was predictive accuracy….. They began working on complex prediction problems where it was obvious that data models were not applicable: speech recognition, image recognition, nonlinear time series prediction, handwriting recognition, prediction in financial markets.

The advances in methodology and increases in predictive accuracy since the mid-1980s that have occurred in the research of machine learning has been phenomenal…. What has been learned? The three lessons that seem most important:

  • Rashomon: the multiplicity of good models;
  •           • Occam: the conflict between simplicity and accuracy;
  •           • Bellman: dimensionality — blessing or curse

Leo Breiman, The Two Cultures of Statistics (2001)

(which are: machine learning / artificial intelligence / algorithmists —vs— model builders / statistics / econometrics / psychometrics)