Two Birds With One Stone
I wrote a post a while back about the Web 3.0, and how in my opinion such nomenclature could be defined as "the emergence of inter-related services", if I am allowed to paraphrase in such a way. Partially in defense of my previous post, and partially out of sheer admiration, I wanted to post about the reCAPTCHA project being run by Carnegie Melon University.
First of all, if you aren't sure what a CAPTCHA is, they are the (somewhat annoying) graphics that appear in some web forms that force you to type in a group of characters to prove that you are human. The idea being (originally) that if you could correctly enter the words that are displayed in an image (which are not readable by robots) then you are a human and your contribution, be it a comment, a forum post, or something like that, should be allowed. The problem is that spammers catch on quick, and OCR, or Optical Character Recognition, software has come a long way. There in lies the problem, or problems, that the folks a Carnegie Melon have gracefully begun to tackle with reCAPTCHA.
The first problem is that of spam on the internet. CAPTCHA's have their place, and remain a fairly good way of determining if a contributor is human or not. The second problem is related to a weakness in CAPTCHA's, which is that OCR software can be used to crack them. This is why many of the CAPTCHA challenges that you see are distorted, whether it be with background noise or with some sort of physical distortion. The idea being that if you distort text then an OCR cannot read it, but a human still can. Seems reasonable enough. But here is where the real innovation comes in.
Let's skip a minute to another project that is Internet related, and that is the scanning of pre-digital age texts. In order to get older texts digitized, they need to be scanned, and then combed over with an OCR in order to convert the scanned (bitmap) information into text (vector) information. Well, the people running this project have run into the same problem that many spammers have run into, and that is that the OCR programs cannot correctly read distorted text, such as what might get scanned in near the binding of an older book.
Enter the reCAPTCHA project. The idea is that during the process of scanning literature in order to digitize it, the OCR programs encounter a number of words that they simply cannot decipher correctly. It just so happens that a human will need to decipher them correctly, which is also the fundamental idea behind a CAPTCHA challenge. Force a human to decipher text that an OCR cannot decipher. Kill two birds with one stone is exactly that they have conceived, in that every time you use reCAPTCHA, the data you enter to prove that you are human is also used to decipher scanned literature.
Of course, it is much more complicated than that, mostly because, as I said before, spammers are smart. In order to be sure, words are cycled through multiple times, and also coupled with words that they already know the value of. The idea being that they can gain a level of confidence over time and with repetition. To be honest, I wish I could explain it as well as they can, but going to the official reCAPTCHA website is probably the best way to learn more.

Connect with us