In last week’s WSJ column (subscription only, I’m afraid) I wrote about how Bayesian Filters — derived from the theories of an 18th century vicar called Thomas Bayes and used to filter out spam — could also be used to sift through other kinds of data. Here’s a preliminary list of some of the uses I came across:
- Deconstructing Sundance: how a bunch of guys at UnSpam Technologies successfully predicted the winners (or at least who would be among the winners) at this year’s festival using POPFile, the Bayesian filter of choice;
- ShopZilla a “leading shopping search engine” uses POPFile “in collaboration with Kana to filter customer emails into different buckets so we can apply the appropriate quality of service and have the right people to answer to the emails. Fortunately, some of the buckets can receive satisfactory canned responses. The bottom line is that PopFile provides us with a way to send better customer responses while saving time and money.”
- Indeed, even on-spam email can benefit from Bayes, filtering boring from non-boring email, say, or personal from work. Jon Udell experimented with this kind of thing a few years ago.
- So can virus and malware. Here’s a post on the work by Martin Overton in keeping out the bad stuff simply using a Bayesian Filter. Here’s Martin’s actual paper (PDF only). (Martin has commented that he actually has two blogs addressing his work in this field, here and here.)
- John Graham-Cumming, author of POPFile, says he’s been approached by people who would like to use it in regulatory fields, in computational biology, dating websites (“training a filter for learning your preferences for your ideal wife,”, as he puts it), and says he’s been considering feeding in articles from WSJ and The Economist in an attempt to find a way predict weekly stock market prices. “If we do find it out,” he says, “we won’t tell you for a few years.” So he’s probably already doing it.
If you’re new to Bayes, I hope this doesn’t put you off. All you have to do is show it what to do and then leave it alone. If you haven’t tried POPFile and you’re having spam issues, give it a try. It’s free, easy to install and will probably be the smartest bit of software on your computer.
I suppose the way I see it is that Bayesian filters don’t care about how words look, what language they’re in, or what they mean, or even if they are words. They look at how the words behave. So while the Unspam guys found out that a word “riveting” was much more likely to be used by a reviewer to describe a dud movie than a good one, the Bayesian Filter isn’t going to care that that seems somewhat contradictory. In real life we would have been fooled, because we know “riveting” is a good thing (unless it’s some weird wedgie-style torture involving jeans that I haven’t come across). Bayes doesn’t know that. It just knows that it has an unhealthy habit of cropping up in movies that bomb.
In a word, Bayesian Filters watches what words do, or what the email is using the words to do, rather than look at the meaning of the words. We should be applying this to speeches of politicians, CEOs, PR types and see what comes out. Is there any way of measuring how successful a politician is going to be based on their early speeches? What about press releases? Any way of predicting the success of the products they tout?