How to Make More Use of the Vicar

In last week’s WSJ column (subscription only, I’m afraid) I wrote about how Bayesian Filters — derived from the theories of an 18th century vicar called Thomas Bayes and used to filter out spam — could also be used to sift through other kinds of data. Here’s a preliminary list of some of the uses I came across:

  • Deconstructing Sundance: how a bunch of guys at UnSpam Technologies successfully predicted the winners (or at least who would be among the winners) at this year’s festival using POPFile, the Bayesian filter of choice;
  • ShopZilla a “leading shopping search engine” uses POPFile “in collaboration with Kana to filter customer emails into different buckets so we can apply the appropriate quality of service and have the right people to answer to the emails. Fortunately, some of the buckets can receive satisfactory canned responses. The bottom line is that PopFile provides us with a way to send better customer responses while saving time and money.”
  • Indeed, even on-spam email can benefit from Bayes, filtering boring from non-boring email, say, or personal from work. Jon Udell experimented with this kind of thing a few years ago.
  • So can virus and malware. Here’s a post on the work by Martin Overton in keeping out the bad stuff simply using a Bayesian Filter. Here’s Martin’s actual paper (PDF only). (Martin has commented that he actually has two blogs addressing his work in this field, here and here.)
  • John Graham-Cumming, author of POPFile, says he’s been approached by people who would like to use it in regulatory fields, in computational biology, dating websites (“training a filter for learning your preferences for your ideal wife,”, as he puts it), and says he’s been considering feeding in articles from WSJ and The Economist in an attempt to find a way predict weekly stock market prices. “If we do find it out,” he says, “we won’t tell you for a few years.” So he’s probably already doing it.

If you’re new to Bayes, I hope this doesn’t put you off. All you have to do is show it what to do and then leave it alone.  If you haven’t tried POPFile and you’re having spam issues, give it a try. It’s free, easy to install and will probably be the smartest bit of software on your computer.

I suppose the way I see it is that Bayesian filters don’t care about how words look, what language they’re in, or what they mean, or even if they are words. They look at how the words behave. So while the Unspam guys found out that a word “riveting” was much more likely to be used by a reviewer to describe a dud movie than a good one, the Bayesian Filter isn’t going to care that that seems somewhat contradictory. In real life we would have been fooled, because we know “riveting” is a good thing (unless it’s some weird wedgie-style torture involving jeans that I haven’t come across). Bayes doesn’t know that. It just knows that it has an unhealthy habit of cropping up in movies that bomb.

 In a word, Bayesian Filters watches what words do, or what the email is using the words to do, rather than look at the meaning of the words. We should be applying this to speeches of politicians, CEOs, PR types and see what comes out. Is there any way of measuring how successful a politician is going to be based on their early speeches? What about press releases? Any way of predicting the success of the products they tout?

technorati tags: , , ,

Software: Spam Bully out of beta

 Spam Bully, an email spam filter that integrates into Outlook and Outlook Express, is now out of beta and officially ready to go.
 
 
I haven’t given Spam Bully a test run, but it uses Bayesian Filters, an approach I wrote about a few weeks back, so in theory should work well.
 
From their press release: “Spam Bully’s self-learning email filter uses a probability based mathematical theory developed by 18th century British clergyman Thomas Bayes. Bayes’ theorem is based on the number of times an event has or has not occurred and the likelihood it will occur in the future. Using Bayes’ theories in conjunction with email filtration allows Spam Bully to determine the probability that an email is “spam” based on the words it contains. Spam Bully’s Bayesian filter was created from over 35,000 spam messages, allowing it to intelligently learn which words spammers are likely to use. Spam Bully will adapt itself to a user’s own email preferences and over time continually adjusts to new types of spam.”
 
Spam Bully costs $30.

Column: An end to spam?

Loose Wire — Exorcism for Spam: A theory devised by an English vicar and adopted by smart anti-spammers is your best bet for keeping spam out of your inbox

By Jeremy Wagstaff

from the 19 June 2003 edition of the Far Eastern Economic Review, (c) 2003, Dow Jones & Company, Inc.

A milestone, of sorts, was passed last month. According to MessageLabs, a United States-based company that studies these things, the Internet for the first time handled more spam e-mail messages than normal e-mails. In other words, for every legitimate e-mail sent, there was at least one spam, or unsolicited junk e-mail, sent. Compare that with a year ago when the ratio was about one spam for every 20 e-mails. A year before that? One in 1,500. Spam was never pretty, but it’s getting ugly, and something has to give. But what?

Spam is a business, and understanding that is halfway to embracing a solution that works. Why, for example, does MessageLabs spend so much time counting spam? Because it sells services and software that help companies avoid it. In fact, spam is, I suspect, much more profitable for the folk who clean it up than the guys who put it out. Think about it: It costs a spammer very little to send one e-mail, and only one in 10 million to generate a sale to stay in business, but God knows how much in lost man-hours for you or I to receive it, open it, read it, feel slightly nauseous, discard it and then wander over to the water cooler to complain to colleagues about it. There are conflicts of interest here that make me slightly uncomfortable advising you to buy products to keep out what shouldn’t be in your inbox anyway.

So here’s my solution: It’s simple, costs you nothing and will improve as you get more spam. Most anti-spam software looks for things it recognizes as spam-like: words like “Viagra,” for example, and filters it out. But this isn’t always that effective — replace “i” with “1” and you have v1agra, or add some invisible formatting code in the middle of the word, so the word looks the same to a reader, but different to a spam filter. So as spammers get more cunning, filters have to get smarter. This is why using logic, rather than keywords, makes sense. Enter an 18th-century vicar called Thomas Bayes from the English town of Tunbridge Wells. He devised a probability theory that has become a useful tool in gauging whether e-mail is spam or not.

Briefly, Bayesian filters look at the content of e-mail (including the headers, in most cases, and the hidden code in e-mails, called HTML, that organizes fonts, colours and pictures), slices it into bits — words and chunks of code — and judges the probability of each bit being evidence of spam. It will then scrutinize the 15 most interesting bits and add up their probabilities (0.99, for example, meaning 99% likely it’s spam) and then cast judgment on the e-mail. The more you prod it along — yes, this one is spam; no, this one looks like spam but is actually my Auntie Edith suggesting I have plastic surgery — the better it gets. And of course the more e-mail you get, the more it has to play with. Bayesian filters don’t just look for matches, they look for patterns of behaviour that give spam away.

For starters, try POPFile which will work on most operating systems and with most e-mail programs. If you’re squeamish about manual tweaking, check out Spammunition for Outlook or SpamBully for Outlook or Outlook Express ($30 from www.spambully.com).

On top of that, try a trick of my own: Ask colleagues or friends to assign agreed tags to subject lines and set up your e-mail program to recognize those tags and filter them into special folders. [Meet] for example, could be used to relate to meetings, [Budget] for stuff related to how much money you plan to waste that year and [Fire] for e-mails alerting staff they’re being downsized. Such e-mails would then leap past any filters and be easy to search for. Spam’s not going to go away soon, but with good filters you need never see it in your inbox again. Or go to the water cooler.