3rd March 2005

A short monograph on the theme of blog comment spam

I mentioned recently that I’ve been experimenting with ways to trap comment spam, the scourge of bloggers everywhere. This will get a bit technical later, and I’ll lose a lot of you, so let’s start with my conclusion: it’s easy. If you’re plagued by comment spam, it can be prevented – all of it (okay… most of it) – with almost zero effort. And I’m going to tell you how.

Personally, since I moved from MovableType to WordPress, I haven’t been getting a great deal of spam. Nevertheless, it was a minor annoyance, so I decided to implement one or two snares to filter it out. By ‘one or two snares’ I mean a whole minefield of, er, mines, which would explode under the feet of any spammer who tried to cross. And to satisfy my curiosity, I set it up to email me each time a spam comment fell foul of one of my explosives so I could see which tricks were proving the most effective. But like I say, I wasn’t getting much spam, so the results wouldn’t have much statistical significance.

This is where serendipity played its part, for the next day, Mort mentioned to me that she and MM were receiving a lot of comment spam. Naturally, I implemented the same tricks on their blogs as I had on mine, with the rejected comments arriving in my email (as well as the unrejected ones, so I could make sure nothing was getting through that shouldn’t). Suddenly my dataset had increased manyfold, so I decided to collect data for the duration of February and then write a short monograph on the subject. This is it. If you’re not interested in such matters, this will be boring. Don’t read it. Really.

As luck would have it, Mort, Mort’s Mom and myself all use WordPress for our blogs. This was handy because it meant I could use the exact same code for all of them – the only downside is that they perhaps aren’t as representative of all blogs everywhere as they could be, though the techniques I used would work for any blogging system, and I imagine the results would be much the same. But I suppose you’re all screaming to hear just what these cunning things I did are. You are, right? Yes, I thought you were.

There are, in fact, all sorts of things you can do to prevent comment spam. My solution could justifiably be regarded as overkill, but there are hundreds – nay, thousands – of other tricks I could have employed if I wanted to. Some of the most popular, such as kitten’s spaminator, approach the problem by analysing the comments once they’ve been submitted and throwing away those that look like spam based on various telling signs. That’s a good approach, and I’ve found kitten’s spaminator to be very effective, but I wanted to see if I could catch them earlier than that, preventing spam from being submitted in the first place. There are well known ways of doing that too, such as CAPTCHAs, those annoying images with numbers in them that you have to type in a box, often so distorted to prevent computers from reading them that even humans have trouble. CAPTCHAs are far from ideal – besides being an annoyance to users, they have serious accessibility issues (you’re stuffed if you’re blind), and they can be gotten around if you’re determined enough. The basic idea of CAPTCHAs, though, is a good one. It’s a Turing test (sort of) – you present something which is easy for humans but hard for computers. The holy grail of spam prevention is a Turing test that’s so easy for humans, they don’t even realise they’re doing one. I haven’t come up with a way of achieving that yet.

So, there are lots of ways to stymie the spambots. Why, then, am I about to tell you my way? Wouldn’t it be better if I encouraged you to go off and come up with a technique of your own? Surely if everyone used a different method, it would be harder for the spammers to get round them all? That is true, but I’m not sure that’s necessarily a good thing. I want the spammers to get round my traps. When they do, I’ll add some more. It’s an arms race, and it’s in the interest of those of us who despise spam that the race moved forward as quickly as possible, because we’re guaranteed to win it. We have two big advantages over the spammers. 1) It’s very hard to write a program that can pass a Turing test, but very easy to make a Turing test; and 2) no matter how smart they get, it’s simply impossible to make spam comments indistinguishable from real comments because, when it comes right down to it, there is a difference. If there wasn’t, they wouldn’t be spam. It might be that when the difference becomes subtle enough, only advanced AI techniques are able to detect it, and perhaps if the arms race goes too quickly we’ll reach that point before such techniques exist, but I don’t think that will be a problem. My message to the spammers, then, is a simple one: Bring! It! On!

Let’s get down to the details. For those of you not of a technical bent, this would be a good time to go and put the kettle on. Alternatively, here are some pictures of kittens. In fact, unless you’re a codey type person with an unhealthy interest in HTML, I seriously advise you not to read on. Go and look at the kittens instead.

Right, those of you who are still with me, the first thing we do is eliminate all trackback spam by turning off trackbacks and deleting the file that handles them (wp-trackback.php in WordPress). If you like trackbacks then you might not want to go with such a drastic solution, in which case you’re on your own. I just find them annoying.

To detect proper comment spam, I did the following (if you care about the details of these, look at my source code. But don’t look too closely, I know it’s horrific. I didn’t know what the hell I was doing when I wrote this site.):

1. Renamed the page that handles form submissions to stymie any bots that just assume it’s in the default location.

2. Preceded the form where you enter comments with two dummy forms – an empty one (for really stupid bots) followed by one that looks identical to the real one. Both these forms submit their info to the wrong page. They’re hidden from real people using the magic of CSS.

3. Did the same thing after the form, in reverse order, in case any bots start at the bottom of the page and work their way up.

4. Added a hidden field to the form which gets sent along with the other stuff. When a comment’s submitted, it checks that this field has been sent, and that it has the correct value. The value is based on the current date, so changes every day. To get this far, then, bots would have to parse the HTML to locate the correct form (and not be thrown off by the dummy forms surrounding it), and extract the names and values of all the fields. But – and here’s the evil part – the value of the hidden field in the html is wrong. It’s replaced by the correct value after the page has loaded by javascript.

5. Turned the Submit button into an image, which means the x-y position where it was clicked is logged. If no x-y position is given, we take that to mean it was submitted by a bot. This is flawed because a real person can tab to the Submit button and press return, submitting the form without actually clicking. This happened once and a legitimate comment was rejected, so I switched off this test but continued to monitor it. It turned out that this trap never caught any spam.

6. Logged the number of keypresses made when entering comments. Any comments where it’s less than two are rejected.

In total, 2079 spam comments were left and 287 genuine comments. All spams were caught, with one false positive (caused by number 5 in the list, which didn’t catch any real spam, so can be disabled with no negative impact). What traps caught the most spam varied between the three blogs, which isn’t surprising because presumably they’re all in the databases of different spambots.

So which of these methods was successful at snaring spam? Ooh, let’s have some statistics!

Number of spams sent to the default WordPress comment handling page (which nothing in the HTML mentions):
My blog: 0
Mort’s blog: 186
Mort’s Mom’s blog: 407

Number of spams sent to the page which the skeletal dummy form above all the other forms points at:
My blog: 55
Mort’s blog: 0
Mort’s Mom’s blog: 0

Number of spams sent to the page which the not-so-skeletal dummy form immediately above the real form points at:
My blog: 0
Mort’s blog: 825
Mort’s Mom’s blog: 551

Number of spams sent to the page which the not-so-skeletal dummy form below the real form points at:
My blog: 9
Mort’s blog: 0
Mort’s Mom’s blog: 39

Number of spams sent to the page which the skeletal dummy form at the bottom points at:

Number of spams sent to the correct form but without a value for the hidden field:

Number of spams sent to the correct form with an incorrect value for the hidden field:

Number of comments left with no x-y value for the submit button:
My blog: 2 (both legitimate comments – see number 5 above)
Mort’s blog: 1 (legitimate comment – ditto)
Mort’s Mom: 0

Number of spams sent to the correct form, with zero keypresses logged in the comment field:
My blog: 3
Mort’s blog: 0
Mort’s Mom’s blog: 0

Number of spams sent to the correct form, with one keypress logged in the comment field:
My blog: 1
Mort’s blog: 0
Mort’s Mom’s blog: 0

Those last two categories are particularly interesting – I can only attribute them to seriously desperate spammers who don’t have software to do their job, and actually did it manually, pasting in the comment from the clipboard (no keypresses if they used the context menu, one if they did Ctrl+V). Considering the number of blogs you must need to spam before seeing any benefit, these people have really got their work cut out.

In conclusion, then: trapping comment spam is easy. Renaming the page that handles comments (wp-comments-post.php in WordPress) and the bit in the comments form that references it (wp-comments.php), and sandwiching this between identical, hidden forms which point to pages that don’t exist will catch all spam save that entered manually by the truly desperate. These can be detected with a bit of javascript that counts keypresses. Do all that, and spam comment will be a thing of the past – at least until the spammers update their software accordingly, at which point it’s time for the next round. Bring! It! On!