Schrödinger's Collaborative Spam Filter

DeWitt Clinton
June 2005

Schrodinger's Spam

I'd like to throw out an idea that I'll call Schrödinger's Collaborative Spam Filter.

As everyone knows, email spam is a huge problem. It costs innocent people money and wastes their time. It preys on the naïve and defenseless. It is often fraudulent and illegal. Now I'm non-violent by desire, but if I happened to be face to face with someone who sends email spam (or blog comment, search engine, or referrer spam) then I would honestly have to restrain myself from getting physical with them. For the life of me I can't understand the mentality of someone would willingly choose to try and profit by such a morally bankrupt practice.

Some combination of Bayesian filtering, sender validation, and blacklisting have been reasonably effective at combating the worst email spam. If you run your own mail server than you can make some headway again spam. However, if you use your ISP's mail server or an webmail application such as Hotmail, Yahoo! Mail, or Gmail, then you are essentially at the mercy of how well they keep the spam out of your inbox.

As far as I can tell, all of the major webmail providers employ some form of collaborative filtering technique that leverage Other People's Work (i.e., their users) to combat spam. In other words, if 100 users mark similar looking email as spam then they can block it from being delivered to the next 100,000. Of course, it's not that simple -- spammers have long been varying the individual contents of each spam email to make it difficult to wholesale reject a particular message. But from the user's perspective, most webmail providers are getting pretty darn good at blocking the bulk of spam email. Anecdotally, I use Gmail and they are blocking on average 70 spam emails to me each day and are letting about 10 through to my inbox.

It's those last ten spam emails that slip through that I'm starting to wonder about. Most of them are quite obviously spam. Or at least, they look exactly like all the other spam emails that were caught by the default filters that Gmail employs. So as a user I'm frustrated that they can catch all the other ones but let those slip through. But then you stop and think -- well, if they are using collaborative techniques, then perhaps I was the first person to get that particular piece of spam. Maybe no one had been able to flag the patterns yet, so it got delivered to me. If I mark it spam, then it is less likely to land in someone else's mailbox next time."

Maybe that's what you or I would think. But that's not what Erwin Schrödinger would think. No, that crafty Austrian would think "before I've observed it, it is neither spam nor not spam, it is both spam and not spam." The act of observation would cause the email to collapse into a spam or non-spam state. (Please don't anyone think I'm serious about the physics of this, it's a metaphor.) Going further -- until I've checked my inbox, there is no message, spam or otherwise. (Actually, to be really accurate, that's not what he'd think -- he'd think that is was either/or the whole time.)

So the conclusion is -- why stop filtering spam once it has been delivered to your inbox? If it has been delivered, but you haven't seen it yet, and 100 people say that email is spam, why not remove it before I see it? The system knows whether or not you've read your email, so it knows what it can do without you even knowing that it has been done.

Empirical evidence suggests that webmail applications are not Schrödinger filtering yet. I have made a habit of only checking my email a handful of times a day, and when I check it at the end of the day I often find spam that has been sitting around for hours. And since some of that spam is of the obvious type, I can only assume it was delivered to my inbox before the collaborative filtering kicked in. The question is -- why was it still there when I checked it 6 hours later? At that point it is known spam, and the user shouldn't have to see it.

Now the "Copenhagen interpretation" of this idea implies that if you can observe the system prior to checking your inbox, then the wavefunction (spam or not-spam) must collapse early and thus defeat the Schrödinger filter. This is important. It means that if you forward your email, or if you use a mail notifier (a "biff"), if you use the nifty RSS feeds, etc., then you will force an early determination of the spam-ness of each message. If each message needs to collapse into spam or non-spam soon after it is delivered, then you are unable to perform a retroactive purging of spam.

Then again, why not offer the option? Why not ask your users to give you permission to delete email from the inbox if you are really certain it is spam? I for one wouldn't object to that at all. I'd rather have my inbox clean itself up than force me to do the work.

Granted, this idea has undoubtably been proposed and implemented a hundred times before. I just figured I'd toss it out there in case anyone hadn't though of it themselves. (Jeremy, feel free to pass this one along... Maybe I'll switch over to Yahoo Mail.)

And speaking of switching, has anyone built a "home version" of a webmail client that kicks as much ass as Gmail? I've been running my own mail server for the past ten years, loyally using mutt as an email client, until the day I tried Gmail for the first time and never looked back. As I wrote before, Gmail's user experience not only blows away the webmail competition, it blows away the desktop competition.

That said, I'd drop Gmail in a heartbeat if I could find an alternative that I could host myself. But as far as I know, none exist. So in the spirit of the lazyweb (not sure if I like that term), I'd be happy to kick in a bounty of a few hundred dollars to anyone who writes a webmail application that I like enough to use for myself on unto.net. I could write the back-end myself (though not to scale to millions of users, of course), but the front-end is far beyond my meager client-side means. If you know of anything, let me know...