Testing Spam Products - Use Corpuses at Your Own Risk

Submitted by Mike Rothman on Thu, 2006-03-09 08:06.
Saw a post yesterday on the Ferris Research blog [read it here] about a publicly available spam corpus called TREC 2005 (for more information about TREC read here). Ferris’ position on the value of TREC 2005 is:
The corpus is primarily intended for academic research and development of anti-spam filters and has significant restrictions on its use. This collection is important as it provides a standardized collection to test and compare spam filters in both academic and commercial contexts.


They are wrong. Using any corpus older than a month and obscuring the mail headers is actually detrimental to testing and comparing spam filters. Why? Because spam is a real time phenomenon and using “stale-mail” to test it is a waste of time. Your results will smell worse than 30 day old Wonderbread.

To be clear, a bulk of spam is no longer sent by that shady character using a spam cannon in his garage to blast out 200 million messages a day. Spam is sent by a worldwide network of zombies that have made it much harder to track and stop the onslaught.

A key technical innovation in defending against these zombies was the reputation system. IronPort’s Senderbase and CipherTrust’s TrustedSource are the two highest profile reputation systems out there. Basically, by tracking the types of messages coming from a specific IP (and using some fancy mathematics), you can get a pretty good feel for whether they are a legitimate sender or not.

Combining reputation with heuristics and signatures creates a cocktail of techniques that can be used to more accurately detect spam. Now anyone that says they can consistently always stop 99% of spam is lying to you. Spamming techniques change fast enough that effectiveness will ebb and flow as the spammers and anti-spammers engage in constant point-counterpoint. But in general, most of the solutions out there do a good enough job.

Now back to TREC 2005. I am a big fan of bake-offs (technical evaluations) during the procurement process (see Buying Security products post). Having users compare spam catch rates using stale-mail is a disservice because real time reputation checks cannot happen on stale mail. Who the message is coming from is a critical part of today’s detection techniques. So, using a pre-baked corpus eliminates that set of tests and will make your results suspect at best.

It is also a very bad idea to just forward the test corpus through a bit blaster. This puts your email security gateway as the second hop in and obscures the true sender’s mail header. This dramatically impacts your ability to accurately detect the spam. I can get into more technical nuances off-line, but take my word for it. Your results will be crap. In fact, a number of well-known publications used this technique in early anti-spam reviews and their results weren’t worth the paper they were printed on. But it took them 18 months (and a lot of my personal blood and sweat) to get them to see the fault in this testing methodology.

So how do you test anti-spam products? Basically you need to use them in real mail flow. I believe that you set up a set of test users (that are a bit more understanding than your CEO) and run their ACTUAL mail through the box for a month. Then you can gauge real time effectiveness and select the best fit for your organization. 

UPDATE: Let me clarify a bit that a corpus like this will be useful to an anti-spam research, who presumably understands how to tune their heuristics and/or signatures. My point is that this kind of corpus will NOT be useful to end users trying to compare anti-spam products.

 

Submitted by Gordon Cormack (not verified) on Fri, 2006-03-10 10:13.
Mike uses a number of strawman arguments and half-truths to attack the TREC evaluation methodology and the TREC corpus. Any test, including TREC, will have certain limitations. Drug testing, for example, is difficult to do as you can't simply start feeding random chemicals to people. Even if you could do so ethically, you wouldn't be able to measure the effect due to lack of scientific controls. So drug testers do various kids of experiments: in vitro experiments, in vivo in animals, and in vivo in humans. Even the in vivo in humans typically use a subject population that is not identical to that of general humankind. Mike has stated the hypothesis that real-time techniques etc. are very important to spam filtering. However, he does not suggest any valid mechanism to test that claim. "Just try it" is equivalent to feeding random chemicals to humans and calling it a drug trial. So his assertion that these real-time effects dominate content-based learning effects are at the moment mere conjecture. Mike points suggests some limitations of the TREC method - some of his assertions have some validity and some are outright false. I'll get to that in the moment. But first a comment - pointing out limitations in one sort of test does not lead to the conclusion that one should throw out the test and go back to subjective impressions to decide what's good and what is bad. The appropriate thing to do is to construct different - perhaps better, perhaps different - tests and to *compare* the results. So, for example, if Mike were to do some of his preferred testing, and also test the same things using TREC methods, one of two thing would happen: the tests would agree or the tests would disagree. If the former, then our confidence is increased; if the latter, we have to construct more tests to figure out why. Mike asserts that we obscured the mail headers. We did not. Mike asserts that the spam in the corpus is old. It is not. Mike asserts that reputation based systems are slam-dunk winners over content-based systems. There was an experiment at TREC (IBM) that showed some improvement, but the heavy lifting was done by the content-based part. I don't purport to say this result is definitive, but if Mike thinks it is so obvious he should construct some sort of experiment to demonstrate it. "Trust me" just doesn't cut it, I'm afraid. While Mike's constructing his experiment I suggest he pay careful attention to the adjudication of messages - how exactly is he going to know which messages are spam and which are not? Users take time to report errors, and also notoriously underreport errors. Even a small amount of underreporting of errors can result in a huge overestimate of the effectiveness of spam filters. The bottom line is that I would be happy to sit down and discuss the validity of both in vitro and in vivo spam tests. However, throwing out the results of one in favour of a half-baked version of the other hardly increases our knowledge.
Submitted by Mike Rothman on Fri, 2006-03-10 10:51.

Gordan, I appreciate your comments and the passion in which you state your argument. But let me make something abundantly clear, I am not in academia and this is not an academic argument where there are absolute rights and wrongs that can be hypothesized, tested and proven. That's not my job, and if you think that because I don't test products that I am not entitled to have an opinion, that's fine. But that's an issue you have with the broader research business, and we'll need to agree to disagree on that.

My focus is in making sure that end users do not waste their time with useless tests that will not help them make better decisions on which security products to use. I do not contend that YOU obscured the headers, but in my experience many of the testing harnesses used to evalute these products take a corpus (maybe yours, maybe someone else's) and just forward the messages along. THUS OBSCURING THE HEADERS. So I don't have any issue with the validity of your data set, but I do have an issue with how a great majority of the unsophisticated users will ultimately use the data set to potentially draw the wrong conclusions.

That is what I'm pointing out. Users should use a corpus at their OWN RISK. Similar to your drug testing analogy. I'm not saying NOT to use experimental drugs, but users need to understand that using this approach exclusively will leave a huge gap in the test, namely the ability to detect real time attacks.

Since your dataset is called TREC 2005, how can you tell me it's not OLD? It is now March 2006, which if all messages were gathered on Dec 31, 2005 still makes the data set over two months old. The reality of the market (whether you want to believe it or not) is that a bulk of the spam is sent by zombies now, and an aged corpus is not going to help you understand how a product is going to stop this real time spam. If your data set is refreshed throughout the year, then calling it TREC 2005 is confusing and a misnomer.

To further clarify, I don't believe that real time testing is the only solution. But I learned from the school of hard knocks in working with some very big, very real customers (not in a lab situation) that without real time testing, the effectivness of the solution is not up to par. So, I'm saying you need all of the above, real time, heuristics as well as signatures. A corpus can help SPAM RESEARCHERS tune a set of heuristics and in that case, I'm sure your dataset is fine.

I'll again make the point that end users should not be using ANY corpus to test or tune their products. In my opinion, the anti-spam products are mature enough and the problems facing your typical organization are vast enough that someone spending a day, a week or a month tuning an anti-spam product needs to be fired. There are cheap commercial products that can be set up and forgotten. Sure, effectiveness will ebb and flow, but for the most part these products are now hands-off. The vendors have researchers that spend all day trying to make their stuff better. There is no way that a typical end user can spend the time necessary to tune a product enough to make it work as well. It just isn't going to happen.

To reiterate, I recommend that end users put the products into mail flow and test them under the real life environment they live in. Your corpus cannot be representative of a specific companies traffic because spam is subjective. One man's information is another man's spam. You may disagree because you have enough messages to be statistically reliable, but again, I can only rely on my experience in real world situations. Customers need to test a product in a real life situation. A lab eval is not good enough for this type of product. 

Again, I appreciate your comments and passion, but we just disagree.

 

Submitted by Gordon Cormack (not verified) on Fri, 2006-03-10 14:16.
I appreciate that you have moderated somewhat the tone set by your "they are wrong!" opening line. I do not agree that we must agree to disagree. Your claims are testable and tested they should be.

Your bottom line: "I recommend that end users put the products into mail flow and test them under the real life environment they live in. [...] A lab eval is not good enough for this type of product."

My bottom line: An uncontrolled month-long in-house evaluation is unlikely to produce valid results. Appeal to the expertise/claims of the vendor is equally invalid. So is relying on your intuition about the mechanisms of spam senders and spam filters. Independently conducted tests, under controlled conditions, are more likely to predict actual results than any of these.

You have made several claims. They're testable, so prove them. Claim 1 is that up-to-the-minute spam is more readily filtered by the techniques you tout. Claim 2 is that one organization's users may have a different enough definition of spam so as to invert the relative effectiveness of two spam filters. Rather than merely asserting these claims so as to advocate the firing of IT managers, why not investigate them. One way is to measure whether or not they're true by testing filters on old & new data. Do the products you tout really do better on new mail than the old? Do they do better than other methods that aren't so sensitive? If I install a product that is so sensitive, what assurance do I have that it'll still work in 6 months?

What I'd like to see is some positive evidence as to why your alternative - in-house live testing for a short period under unspecified conditions - is so obviously worthy while lab testing is not. Further aspersions on lab testing don't really do that.

Submitted by Mike Rothman on Fri, 2006-03-10 15:13.

It's funny that you think I moderated my tone. I don't. I still think Ferris is wrong. My view remains that end users should not be using stale corpuses to test anti-spam products. I spent 15 months at an anti-spam vendor and during that time we did roughly 400 evaluations, most of which were between 15 and 30 days. And I can tell you from real world experience that it's plenty of time to figure out whether the product will work in a specific environment.

The interesting thing about academia is that it's all about how right you are, not just that you are right. Many feel they need to be the MOST right. And they'll spend months and months testing to assemble the data to prove it. And that's fine, but that's not for me. My driver's ed teacher said it best "you may be right, but you'll still be dead." Meaning it doesn't matter who is right and who is wrong, stay focused on survival.

At the end of the day, that is really my point. For some, a lab testing situation may be sufficient. I happen to disagree with that, but ultimately if an end user is comfortable making a buying decision for an anti-spam product based upon a lab result, who am I to argue with that? But shame on me, if I don't point out the issues that I see with that approach.

In the real world by the time you prove anything with certainty, it's too late. This business moves to fast to have absolute certainty. End users need to make decisions based upon incomplete information every single day. I view the role of the analyst to provide varying viewpoints, so the end user can make educated and balanced decisions. NOT to exhaust every path of doubt or prove with certainty every position. I certainly don't claim to be right all the time, but I do always have an opinion.

Ultimately the customer is right, they either believe what I'm saying and act accordingly or they don't.  And they get to live with the consequences of that decision.

Yes, every one of my positions could be exhaustively tested. But I have no interest nor do I feel a compelling need to do that. Just because something can be tested doesn't mean it should be. I know I'm right because I speak from experience. I don't need a lab report to tell me that. Maybe you do, but I don't.

I used to deal with this kind of skepticism all the time. Being an analyst in my mid-20's, I would walk into a room of grizzled networking and security veterans who were just waiting to pick me apart. You don't have a lab, you aren't hands on, how can you possibly know anything? That was the common refrain. Then I'd start talking and they'd all shut up and listen. Why? Not because I'm smarter or more experienced than them, clearly I wasn't. But because I SPOKE TO MORE PEOPLE THAN THEY DID. That was my job. Their job was to get things done.

You don't need to have a lab if you speak to people with a lab. I very rarely take hardcore technical positions anymore for that very reason. Unless I know I'm right, it's stupid to engage with technical people. But in this case I am very comfortable with my conclusions. That's because in this specific sector, I have more relevant experience than any other analyst out there. Period. Yes, including Ferris.

Now you may not think analysts hold much water and may continue to be skeptical as to the value of what we do. I don't have an issue with that, everyone is entitled to their opinion. But a lot of people pay a lot of analysts a lot of money to get access to those opinions. Being the died in the wool capitalist that I am, I just don't believe this business model would still be around if someone wasn't getting some type of value.

 

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.