Article Comment Spam
As this week wraps up, we are getting ready to implement some of the changes to the website that we have been working on for a couple of months. To start with, we are upgrading to PHP5, due to some of the performance gains and also security improvements. At the same time, we are replacing the database abstraction layer on the site. This represents the PHP objects and code required to interact with the database, which in the past was causing us some performance issues during traffic spikes. One of the most taxing elements of the website to the database is the advertising mechanism, which serves and tracks ads on our site. Since there is a lot of database interaction, we took the time to re-write that section (including improvements to our database structure) so that it would not cause bottlenecks.
As it stands, we are going to be in a position next week where I don't think that we will see a performance issue with the site again until it is a hardware related problem, such as getting so much traffic that we reach the limits of what our server can do. We'll deal with that when it comes. As for me, I will looking at putting in a mechanism to control some of the article comment spam that we have been dealing with.
We are going to be implementing what is called CAPTCHA on our site, for nearly all of the instances where users can submit input to the site, such as article comments. In particular article comments. We have been targeted by so many robots that just submit spam as article comments, presumably in attempts to plant XSS attacks, that I would like to stop that. It wastes server resources, clogs our email, and makes for unhappy mornings where someone has to go through all the spam to approve the one or two legitimate comments.
As I mentioned, the answer is called CAPTCHA, which stands (somewhat strangely) for Completely Automated Public Turing test to tell Computers and Humans Apart. This is one of those instances where the acronym is much more fun to say. In english, this translates to those little images that appear in web forms that have a sequence of letters and numbers. The user must enter those letters and numbers correctly or the system assumes that you are a spammer and denies access. Our friend Michael wrote a simple little script that will generate such validation, and I will be putting it into the article comment form early next week. The technique itself is relatively simple.
When the page is loaded, the PHP script determines a random sequence of letters and numbers. This "code" is then stored in a session variable so that it can be checked against the user input. At the same time, the "code" is also used to generate an image (in our case a PNG image) using PHP's imageMagick library. The image is created and displayed on the web page, with a text field asking the user to enter the "code". When they submit the form, the value that they entered is checked against the session variable, and a determination is made. My hope is that it will filter nearly all the automated spam that comes through.