Welcome | Sign In
TechNewsWorld.com
Future Tech

Antispam Word Jumbles to Help Digitize Books

Print Version
E-Mail Article
Reprints
Antispam Word Jumbles to Help Digitize Books

A Carnegie Mellon University project is using CAPTCHA -- or Completely Automated Public Turing Test to Tell Computers and Humans Apart -- tests to digitize books. Three hundred Web sites have already signed up to use the technology. About 60 million CAPTCHA tests are solved every day.


Crystal Reports - Discover the Latest Innovations.
Download a free trial, view real-time 'behind the scenes' functionality, and learn about new Crystal Reports Server trade in options! Learn more.

Web surfers all too familiar with the distorted-letter tests that accompany so many site registration forms today can now take heart -- the time they spend on those tests is being put to good use.

Thanks to a project at Carnegie Mellon University, a new version of those pesky CAPTCHA (Completely Automated Public Turing Test to Tell Computers and Humans Apart) tests makes the technology work double-duty: Not only does it continue to distinguish between legitimate human users and malevolent spam programs, it also uses the results to aid in the digitization of books for the Internet Archive.

A Carnegie Mellon team led by Luis von Ahn, an assistant professor of computer science and recipient of a MacArthur Foundation genius grant, developed the new tests, dubbed "reCAPTCHAs," which were launched on Wednesday.

Helping OCR

Optical character recognition (OCR) technology used to digitize printed text is often confounded by underlined text, scribbles and fuzzy or otherwise poorly printed letters.

ReCAPTCHA tests work by asking users to type in one distorted but known word along with one that has stumped an optical character recognition (OCR) system working on a digitization project. If the user inputs the known word correctly, then the system has greater confidence that he or she has deciphered the problematic word correctly too.

Each unknown word is submitted to multiple users; if several enter the same translation, the system assumes it is correct.

In this way, the new tests continue to distinguish between humans and machines because they use text that OCR systems have already failed to read. However, they also contribute to book digitization projects by helping OCR systems convert printed text into computer-readable letters.

Wasting 150,000 Hours a Day

Von Ahn worked on the original CAPTCHA technology for Yahoo (Nasdaq: YHOO), and was astounded to later learn that 60 million of the tests are solved every day by people around the world. "When I first found this out, I was quite proud of myself and the impact my research has had," von Ahn told TechNewsWorld.

"But then I started feeling bad: Each time a CAPTCHA is solved, 10 seconds of human time are basically wasted," von Ahn explained. "If you multiply that by 60 million, you get that humanity as a whole wastes about 150,000 hours every day solving CAPTCHAs. That's a lot of time!"

Inspired to come up with additional ways the technology could do something useful for humanity, von Ahn then had the idea of helping to digitize books.

By Thursday night, about 300 Web sites had signed up to use the technology and 20,000 words had been digitized, von Ahn said. One of the first books being tackled is John Dewey's Psychology, he added.

Strength in Numbers

By tapping into the collective power of thousands of computer users worldwide, reCAPTCHA technology is similar to the distributed computing SETI@home project, through which users donate their computers' spare processing time to help process the enormous volumes of radio signals from space that get recorded by radio telescopes around the globe.

With support from Intel (Nasdaq: INTC), von Ahn's team has developed a free, Web-based service that allows individual webmasters to install reCAPTCHAs to protect their sites. Individuals can also use the technology to protect their own e-mail Increase Customer Sales with Email Marketing -- Free Trial from VerticalResponse addresses.

'The Spirit of Web 2.0'

"ReCAPTCHA is a brilliant idea and implementation," Jason Dowdell, operator of media and technology blog MarketingShift, told TechNewsWorld.

"Far too many entrepreneurs have built applications that solve only one problem," Dowdell added. "Von Ahn has built a platform that is incredibly simple at its core yet provides the opportunity to meet some very large challenges -- that's the spirit of Web 2.0."


Print Version E-Mail Article Reprints More by Katherine Noyes


More by Katherine Noyes

Nokia Recalls Potentially Hazardous Chargers
November 09, 2009
Certain chargers for Nokia handsets have a defect that could put users at risk of an electric shock, the company said, as it issued a recall for the devices, which it will replace free of charge. No injuries or incidents have been reported in connection with the flaw; Nokia discovered it in a routine quality control check.
Is There Room for Microsoft at the Linux Table?
November 09, 2009
An ex-Microsoft employee set off minor pandemonium in the blogosphere with this proposition: What if Microsoft were to develop its very own Linux distro? "It's an interesting thought, but a continent would have to split and form a new ocean before Microsoft gains insight enough to dominate a Linux universe," said Slashdot blogger yagu.
Does Wine Make Linux Too Loose?
November 05, 2009
For those Wine aficionados out there, beware of the remote possibility that your Linux system could be infected by Windows-seeking malware. "WINE running a Windows virus is nothing more than a 'stupid Linux trick' ... for now," said Slashdot blogger hairyfeet. But if the year of the Linux desktop ever arrives, he wonders, can Linux hold up to a "tidal wave of stupidity"?
Don't miss a story -- sign up for our FREE e-mail newsletters and view the latest headlines at a glance.
Tech News Flash [ View Sample ]
E-Commerce Minute [ View Sample ]
ECT News Network Weekly Newsletter [ View Sample ]
Shortcuts
ECT News Network Information
Reader Services
Corporate
ECT News Network