John Graham-Cumming, author of Bayesian spam filter POPFile, points me to a neat tool he’s created which will turn an email address into an image that may spare you some spam from bots scouring web pages for email addresses:
This site converts a text-based email address (such as email@example.com) and creates an image that can be inserted on a web site. The image contains the email address and is easily read by a human, but is intended to fool web crawlers that search for email addresses.
I can’t guarantee that this is foolproof, but Project Honeypot reports that image obfuscation of an email address is very effective (they say 100%) against web crawlers.
Enter your email address in the box and the server returns a string of gobbledygook which contains the email address (padded with a large amount of random data to avoid a dictionary attack) encrypted using a key known only to the server. When the image is loaded into the web page the server decrypts the email address and creates the image. (The email address is not stored by the server; it resides only in the HTML on your website.)
‘wagstaff (v): to poke any new technology with a long stick, make sure it does what it says on the box, and summarize the experience in less than 2,000 words’.
John concludes that the book “should be in the toilet. In fact, I think it’s such a good book for reading in small doses in a small, quiet room, that a global band of Gideons-like technology evangelists should be leaving copies in the smallest room in the house of any technophile.” Excellent idea. I’ll get onto my publisher about that.
In last week’s WSJ column (subscription only, I’m afraid) I wrote about how Bayesian Filters — derived from the theories of an 18th century vicar called Thomas Bayes and used to filter out spam — could also be used to sift through other kinds of data. Here’s a preliminary list of some of the uses I came across:
Deconstructing Sundance: how a bunch of guys at UnSpam Technologies successfully predicted the winners (or at least who would be among the winners) at this year’s festival using POPFile, the Bayesian filter of choice;
ShopZilla a “leading shopping search engine” uses POPFile “in collaboration with Kana to filter customer emails into different buckets so we can apply the appropriate quality of service and have the right people to answer to the emails. Fortunately, some of the buckets can receive satisfactory canned responses. The bottom line is that PopFile provides us with a way to send better customer responses while saving time and money.”
Indeed, even on-spam email can benefit from Bayes, filtering boring from non-boring email, say, or personal from work. Jon Udell experimented with this kind of thing a few years ago.
So can virus and malware. Here’s a post on the work by Martin Overton in keeping out the bad stuff simply using a Bayesian Filter. Here’s Martin’s actual paper (PDF only). (Martin has commented that he actually has two blogs addressing his work in this field, here and here.)
John Graham-Cumming, author of POPFile, says he’s been approached by people who would like to use it in regulatory fields, in computational biology, dating websites (“training a filter for learning your preferences for your ideal wife,”, as he puts it), and says he’s been considering feeding in articles from WSJ and The Economist in an attempt to find a way predict weekly stock market prices. “If we do find it out,” he says, “we won’t tell you for a few years.” So he’s probably already doing it.
If you’re new to Bayes, I hope this doesn’t put you off. All you have to do is show it what to do and then leave it alone. If you haven’t tried POPFile and you’re having spam issues, give it a try. It’s free, easy to install and will probably be the smartest bit of software on your computer.
I suppose the way I see it is that Bayesian filters don’t care about how words look, what language they’re in, or what they mean, or even if they are words. They look at how the words behave. So while the Unspam guys found out that a word “riveting” was much more likely to be used by a reviewer to describe a dud movie than a good one, the Bayesian Filter isn’t going to care that that seems somewhat contradictory. In real life we would have been fooled, because we know “riveting” is a good thing (unless it’s some weird wedgie-style torture involving jeans that I haven’t come across). Bayes doesn’t know that. It just knows that it has an unhealthy habit of cropping up in movies that bomb.
In a word, Bayesian Filters watches what words do, or what the email is using the words to do, rather than look at the meaning of the words. We should be applying this to speeches of politicians, CEOs, PR types and see what comes out. Is there any way of measuring how successful a politician is going to be based on their early speeches? What about press releases? Any way of predicting the success of the products they tout?
McAfee seems to have come somewhat late to the spam party: Network Associates, Inc. , ‘the leader in intrusion prevention solutions’, today announced that it has incorporated “powerful new Bayesian filtering into the latest McAfee SpamAssassin engine”. What, only now?
Bayesian filtering is a pretty powerful weapon in the war against spam. I use POPFile and K9 and would recommend either, not least because they’re free. But why has it taken so long for McAfee to get around to including it in their SpamAssassin product?
To be fair, the McAfee Bayesian filter is “fully automated in its learning abilities, whereas other competitive solutions require manual training by users or systems administrators”. That is an improvement, but I wonder how well it works.
SpamKiller/Assassin also includes some other features, including Integrity Analysis, which applies algorithms to determine if the email is spam, Heuristic Detection, Content Filtering, Black and White Lists and DNS-Blocklist Support.
One anti-spam service I tried a few months back was Melbourne-based Aliencamel, which I thought was good but not perfect, have just announced some new features which may make the product more competitive in a tight marketplace. Aliencamel works as a mix of different anti-spam and anti-virus elements designed to keep out the riff-raff so you only download what you want.
The new version turns Aliencamel into a kind of email account in its own right, including the ability to preview email in a web browser before tagging it as spam or downloading via your normal email program, full webmail access to your mailbox, as well as disposable email addresses you can use to deal with suspect web sites and third parties you’re not sure about. On top of that the service’s Pending Email Advisory — a sort of floating alert that lets you know of new email that is suspect without actually sending it to you — changes to reduce frequency of advisory emails.
Most important, I think, is the fact that Aliencamel are going to embrace Bayesian filters — the simple method of assigning a probability of spamminess to emails by looking at the innards of the email (content, header, HTML code) and comparing it to other emails it has looked at. I adore Bayesian filters (I still use POPFile) so I think it’s great that Aliencamel are moving in that direction.
(Aliencamel, by the way, is an anagram of clean email. It took me months to get it.)
An independent reviewer of anti-spam tools I hadn’t heard of called Spamotomy has awarded its highest rating ever for a desktop anti-spam product to InBoxer from Audiotrieve, which I also haven’t heard of. And I thought I was on top of the whole spam thing.
The Spamotomy review, apparently, is the result of an extensive week-long evaluation involving the processing of thousands of email messages, including more than a thousand junk mail messages. InBoxer was effective right from the start, according to the Spamotomy review. At the end of a week, InBoxer removed 96.5% of all spam with a 0.07% false positive rating. That’s not bad, though it’s not as good as POPFile has achieved over a longer period.
A possible downside: InBoxer only works with Microsoft Outlook. The product has a list price of $24.95.
This is not new, but worth passing on to those folk that would like to understand spam a bit better. Spam is a pain for all of us, and it’s not likely to get better. But the more we understand it, the more we can do something about it. If you think it’s just a bunch of sleazy guys who don’t know about computers and don’t know how much damage they cause, read this. It’s a PDF Acrobat file version of the presentation by one John Graham-Cumming, who designed the free POPFile spam filter I use and rave about every chance I get.
John goes into fascinating detail about the tricks spammers use, which helps us realise a couple of things:
1) These spammers are smart, or have smart people working for them;
2) Spamming is not going to go away, and spam filters are going to have to get smarter to keep up;
3) It may be worth splashing out on spam filter software if you’re a big company, but if you’re an individual, you may well be better off using POPFile and doing what you can to support folk like John, who are as close to the cutting edge of anti-spam design as anyone. (If you really like his work, buy some of his stuff.)
I don’t feel like I’ve passed on anything about spam for at least half an hour so here goes. ActiveState, “the leader in enterprise email management software”, has released an ActiveState Field Guide to Spam, which details advanced tricks used by spammers to hide their messages from spam filters.
Regular readers of this blog — or folk who spend their weekends inspecting spam — will be familiar with most of these tricks, but it’s an education nonetheless. However, I am beginning to think that however clever spammers are, there’s a point beyond which it’s just not worth the effort for them. That’s when we all get Bayesian filters running and tune them. The only spam I worry about these days are press releases like this one from ActiveState. I swear it’s taken me longer to find the right link to their website than it would be to clean the one or two bits of spam that get past by my spamblocker (POPFile, in case you haven’t been paying attention). Or am I missing something?
A few weeks back I reported on the revival of Calypso, an excellent email program, by the folks at Rose City Software. Their rechristened Courier does everything Calypso did, but it’s got one or two features that may help tip the balance for those of you not sure it’s worth the hassle switching. My favourite feature is its integration with POPFile, which, coincidentally, is my spam filter of choice (and now 99.18% accurate, I’m glad to report.) Anyway, this is the neat bit: Courier allows you to reclassify email that POPFile may have got wrong — marking it as spam, for example, instead of legit email) just be rightclicking on the email in question. Superb.
One gripe for Rose City: Can we have better icons? I can’t see the yellow envelope in the system tray, especially after a couple of beers.