Antispam Word Jumbles to Help Digitize Books

Web surfers all too familiar with the distorted-letter tests that accompany so many site registration forms today can now take heart — the time they spend on those tests is being put to good use.

Thanks to a project at Carnegie Mellon University, a new version of those pesky CAPTCHA (Completely Automated Public Turing Test to Tell Computers and Humans Apart) tests makes the technology work double-duty: Not only does it continue to distinguish between legitimate human users and malevolent spam programs, it also uses the results to aid in the digitization of books for the Internet Archive.

A Carnegie Mellon team led by Luis von Ahn, an assistant professor of computer science and recipient of a MacArthur Foundation genius grant, developed the new tests, dubbed “reCAPTCHAs,” which were launched on Wednesday.

Helping OCR

Optical character recognition (OCR) technology used to digitize printed text is often confounded by underlined text, scribbles and fuzzy or otherwise poorly printed letters.

ReCAPTCHA tests work by asking users to type in one distorted but known word along with one that has stumped an optical character recognition (OCR) system working on a digitization project. If the user inputs the known word correctly, then the system has greater confidence that he or she has deciphered the problematic word correctly too.

Each unknown word is submitted to multiple users; if several enter the same translation, the system assumes it is correct.

In this way, the new tests continue to distinguish between humans and machines because they use text that OCR systems have already failed to read. However, they also contribute to book digitization projects by helping OCR systems convert printed text into computer-readable letters.

Wasting 150,000 Hours a Day

Von Ahn worked on the original CAPTCHA technology for Yahoo, and was astounded to later learn that 60 million of the tests are solved every day by people around the world. “When I first found this out, I was quite proud of myself and the impact my research has had,” von Ahn told TechNewsWorld.

“But then I started feeling bad: Each time a CAPTCHA is solved, 10 seconds of human time are basically wasted,” von Ahn explained. “If you multiply that by 60 million, you get that humanity as a whole wastes about 150,000 hours every day solving CAPTCHAs. That’s a lot of time!”

Inspired to come up with additional ways the technology could do something useful for humanity, von Ahn then had the idea of helping to digitize books.

By Thursday night, about 300 Web sites had signed up to use the technology and 20,000 words had been digitized, von Ahn said. One of the first books being tackled is John Dewey’s Psychology, he added.

Strength in Numbers

By tapping into the collective power of thousands of computer users worldwide, reCAPTCHA technology is similar to the distributed computing SETI@home project, through which users donate their computers’ spare processing time to help process the enormous volumes of radio signals from space that get recorded by radio telescopes around the globe.

With support from Intel, von Ahn’s team has developed a free, Web-based service that allows individual webmasters to install reCAPTCHAs to protect their sites. Individuals can also use the technology to protect their own e-mail addresses.

‘The Spirit of Web 2.0’

“ReCAPTCHA is a brilliant idea and implementation,” Jason Dowdell, operator of media and technology blog MarketingShift, told TechNewsWorld.

“Far too many entrepreneurs have built applications that solve only one problem,” Dowdell added. “Von Ahn has built a platform that is incredibly simple at its core yet provides the opportunity to meet some very large challenges — that’s the spirit of Web 2.0.”

Leave a Comment

Please sign in to post or reply to a comment. New users create a free account.

More by Katherine Noyes
More in Internet

Technewsworld Channels