Blocking Spam: An Idea

Jun 25, 2007

Consider a firm filtering out spam for a million customers. One way of identifying spam is to look for messages received by multiple customers. If ten thousand people receive identical messages, it is a pretty safe bet that they are all spam.

One problem with doing this is privacy; the customers do not want someone else to be reading their mail. The comparison would, of course, be done by computer, but once the message has been sent to the spam filtering company, the customer has no way of knowing who, other than a computer, is looking at it.

There is a simple solution to the problem. Instead of forwarding your email to the filtering company, forward a hash of your email. Your own computer applies a one way hash function to each message, calculating from it a long number. If the number is long enough, the probability that two different messages will hash to the same number becomes vanishingly small. But a twenty digit number still contains much less information than a hundred word email, so there is no way of reversing the process and deducing the message from its hash. Forward the hash to the spam filtering company--doing that not only protects your privacy, it also takes a lot less bandwidth than forwarding the email. Get back information on whether or not it matches the hash of messages received by many other customers, and junk or read the email accordingly.

Have I just reinvented the wheel? Is anyone currently using some variant of this approach?

David Friedman’s Substack

Discussion about this post