Posts tagged with computational homology
I feel vindicated in several ways by the Netflix Engineering team’s recent blog post explaining what they did with the results of the Netflix Prize. What they wrote confirms what I’ve been saying about recommendations as well as my experience designing recommendation engines for clients, in several ways:
- Fancy ML techniques don’t matter so much. The winning BellKor/Pragmatic Chaos teams implemented ensemble methods with something like 112 techniques smushed together. You know how many of those the Netflix team implemented? Exactly two: RBM's and SVD.
If you’re a would-be internet entrepreneur and your idea relies on some ML but you can’t afford a quant to do the stuff for you, this is good news. Forget learning every cranny of research like Pseudo-Markovian Multibagged Quantile Dark Latent Forests! You can watch an hour-long video on OCW by Gilbert Strang which explains SVD and two hour-long Google Tech Talks by Geoff Hinton on RBM’s. RBM’s are basically a superior subset of neural network with a theoretical basis why it’s superior. SVD is a dimension reduction technique from linear algebra. (There are many Science / Nature papers on dimension reduction in biology; if you don’t have a licence there are paper-request fora on Reddit.)
Not that I don’t love reading about awesome techniques, or that something other than SVD isn’t sometimes appropriate. (In fact using the right technique on the right portion of the problem is valuable.) What Netflix people are telling us is that, in terms of a Kaggleistic one-shot on the monolithic data set, the diminishing marginal improvements to accuracy from a mega-ensemble algo don’t count as useful knowledge.
- Domain knowledge trumps statistical sophistication. This has always been the case in the recommendation engines I’ve done for clients. We spend most of our time trying to understand the space of your customers’ preferences — the cells, the topology, the metric, common-sense bounds, and so on. You can OO program these characteristics. And (see bottom) doing so seems to improve the ML result a lot.
Another reason you’re probably safe ignoring the bleeding edge of ML research is that most papers develop general techniques, test them on famous data sets, and don’t make use of domain-specific knowledge. You want a specific technique that’s going to work with your customers, not a no-free-lunch-but-optimal-according-to-X academic algorithm. Some Googlers did a sentiment-analysis paper on exactly this topic: all of the text analysis papers they had looked at chose not to optimise on specific characteristics (like keywords or text patterns) known to anyone familiar with restaurant-review data. They were able to achieve a superior solution to that particular problem without fancy new maths, only using common sense and exploration specific to their chosen domain (restaurant reviews).
- What you measure matters more than what you squeeze out of the data. The reason I don’t like* Kaggle is that it’s all about squeezing more juice out of existing data. What Netflix has come to understand is that it’s more important to phrase the question differently. The one-to-five-star paradigm is not going to accurately assess their customers’ attitudes toward movies. The similarity space is more like Dr Hinton’s reference to a ten-dimensional library where neighbourhood relationships don’t just go along a Dewey Decimal line but also style, mood, season, director, actors, cinematography, and yes the “People like you” metric (“collaborative filtering”, a spangled bit of jargon).
For them the preferences evolve fairly quickly over time. That has to make it hard. If your users’ preferences evolve over time: good luck, it may be quite hard.
John Wilder Tukey: "To statisticians, hubris should mean the kind of pride that fosters an inflated idea of one’s powers and thereby keeps one from being more than marginally helpful to others. … The feeling of “Give me (or more likely even, give my assistant) the data, and I will tell you what the real answer is!” is one we must all fight against again and again, and yet again." via John D Cook
Relatedly, a friend of mine who’s doing a Ph.D. in complexity (modularity in Bayesian networks) has been reading the Kaggle fora from time to time. His observation of the Kaggle winners is that they usually win with gross assumptions about either the generating process or the underlying domain. Basically they limit the ML search using common sense and data exploration; that gives them a significant boost in performance (
* I admire
@antgoldbloom for following through on his idea and I do think they have a positive impact on the world. Which is much better than the typical “Someone should make X, that would be a great business” or even worse but still typical: "I’ve been saying they should have that!” Still, I do hold to my one point of critique: there’s no back-and-forth in Kaggle’s optimisation.