The corpus is primarily intended for academic research and development of anti-spam filters and has significant restrictions on its use. This collection is important as it provides a standardized collection to test and compare spam filters in both academic and commercial contexts.
They are wrong. Using any corpus older than a month and obscuring the mail headers is actually detrimental to testing and comparing spam filters. Why? Because spam is a real time phenomenon and using “stale-mail” to test it is a waste of time. Your results will smell worse than 30 day old Wonderbread.
To be clear, a bulk of spam is no longer sent by that shady character using a spam cannon in his garage to blast out 200 million messages a day. Spam is sent by a worldwide network of zombies that have made it much harder to track and stop the onslaught.
A key technical innovation in defending against these zombies was the reputation system. IronPort’s Senderbase [3] and CipherTrust’s TrustedSource [4] are the two highest profile reputation systems out there. Basically, by tracking the types of messages coming from a specific IP (and using some fancy mathematics), you can get a pretty good feel for whether they are a legitimate sender or not.
Combining reputation with heuristics and signatures creates a cocktail of techniques that can be used to more accurately detect spam. Now anyone that says they can consistently always stop 99% of spam is lying to you. Spamming techniques change fast enough that effectiveness will ebb and flow as the spammers and anti-spammers engage in constant point-counterpoint. But in general, most of the solutions out there do a good enough job.
Now back to TREC 2005. I am a big fan of bake-offs (technical evaluations) during the procurement process (see Buying Security products post [4]). Having users compare spam catch rates using stale-mail is a disservice because real time reputation checks cannot happen on stale mail. Who the message is coming from is a critical part of today’s detection techniques. So, using a pre-baked corpus eliminates that set of tests and will make your results suspect at best.
It is also a very bad idea to just forward the test corpus through a bit blaster. This puts your email security gateway as the second hop in and obscures the true sender’s mail header. This dramatically impacts your ability to accurately detect the spam. I can get into more technical nuances off-line, but take my word for it. Your results will be crap. In fact, a number of well-known publications used this technique in early anti-spam reviews and their results weren’t worth the paper they were printed on. But it took them 18 months (and a lot of my personal blood and sweat) to get them to see the fault in this testing methodology.
So how do you test anti-spam products? Basically you need to use them in real mail flow. I believe that you set up a set of test users (that are a bit more understanding than your CEO) and run their ACTUAL mail through the box for a month. Then you can gauge real time effectiveness and select the best fit for your organization.
UPDATE: Let me clarify a bit that a corpus like this will be useful to an anti-spam research, who presumably understands how to tune their heuristics and/or signatures. My point is that this kind of corpus will NOT be useful to end users trying to compare anti-spam products.