Get the E-Commerce Minute Newsletter from the E-Commerce Times » View Sample | Subscribe
Welcome Guest | Sign In
Women in Tech

Text Extractor Yanks Aching PDF Problem

By John P. Mello Jr. MacNewsWorld ECT News Network
Oct 1, 2013 5:00 AM PT

Text Extractor Yanks Aching PDF Problem

Text Extractor - Extract text from PDF & Image with OCR by Gerald Ni of Lighten Software is available at the Mac App Store for US$12.99.

Nobody likes shuffling through paper documents to do research on anything. Even if you have a big desk, its surface can get cluttered quickly when dealing with the byproduct of dead trees.

That's why PDFs are so alluring. Adobe's cross-platform file format provides a convenient way to turn paper documents into more manageable digital ones.

There's a hangup, however. Some PDFs are image files. As such, they're more like picture files than word processing documents. That can be very annoying, especially when trying to cut and paste quotes from the PDF document.

Text Extractor App
(click to enlarge)
Image PDF files aren't a total loss for researchers, however, if they have Optical Character Recognition software and a scanner. Then they can create a paper version of the document and scan it into their computer or, depending on the features of their OCR software, skip the paper and scanner steps and suck the PDF file directly into the OCR app for conversion into text.

Both scanners and OCR software can put a dent in a researcher's wallet. Commercial OCR programs typically cost more than $70 and low-end scanners sell at about the same price point. So an app like Text Extractor, which sells at the Mac App Store for $12.99, can be a real bargain for researchers who need an occasional PDF image file turned into text.

Surprising Accuracy

One problem with low-priced OCR software is that it usually isn't very accurate. If OCR isn't accurate, it will be more trouble than it's worth -- unless you're a really bad typist.

Gerald Ni of Lighten Software, the developer of Text Extractor, claims 90 percent accuracy for his app when working with a high-quality source file. If anything, he might be underestimating the accuracy of his app. In fact, its accuracy is as good as its much higher-priced competitors in the market.

Text Extractor does just that: It extracts text from an image PDF. If you want to create a duplicate text PDF -- one laid out exactly as the original and with the same fonts -- you'll need one of those higher-priced OCR programs. If you just want the image PDF turned into a text file, then Text Extractor is your app.

As with all OCR programs, Text Extractor will stumble from time to time. For example, when I scanned a grand jury indictment -- online court documents are notorious for being image rather than text PDFs -- the first page was marred with stamps. There was a filed-with-the-court stamp with handwritten initials, an ORIGINAL stamp, and the case number stamps.

The text in those stamps is of poor quality, and they appear askew on the page. Even the best OCR software would have a problem with that kind of copy. What's surprising is not that Text Extractor flubbed the text in the stamps but that it reliably interpreted the good text around the stamps. That makes it easy to strip out the junk characters created by the stamps.

Ligature Challenged

Text Extractor's OCR was tripped up by a few typographic curveballs. For example, fonts with ligatures are also common in court documents.

Ligatures are used in typesetting to tighten up words and make a line of type look tighter. The letters "fi," for instance, will appear as a single character with no space between the two characters.

That f-i ligature was used several times in the document I brought into Text Extractor, which replaced it with nonsense characters.

Ligatures aren't just a problem with OCR programs, though. I've cut passages from text PDFs that contain ligatures that were turned into nonsense when pasted into a word processing document because the font I was using didn't support them.

I also found Text Extractor fumbling on smart quotes. That's a little more serious problem than being unable to handle arcane typographic characters, because those are more likely to be found in a document than ligatures.

Easy to Use

Text Extractor complements its high-quality OCR with a nice simple interface. To bring a document into the program, you click a large Open File button at the top of the screen. A Finder window will open, and you can search for the file you want to bring into the app.

Then you can slip the OCR slider control to On and choose a language for scanning. In addition to English, nine languages are supported: Dutch, French, German, Italian, Polish, Portuguese, Russian, Spanish and Swedish.

Beside the language menu, there's a dropdown menu for choosing whether to scan all the pages in the document or the current page.

However, you can also select portions of a document for OCR treatment or exclude portions of a document for scanning. That's useful for excluding tabular material or charts in a document. Text Extractor does not covert tables into text very well.

After setting up your document for scanning, you click the large Extract button on the right of the screen and Text Extractor will start cooking. The program converts documents into text at a good clip, but results will vary by the size of the document and the muscle in your Mac.

Converted text will appear in a pane beside the original document for easy comparison. Once converted, you can cut and paste the text into other documents or export the whole converted document to a text file.

Text Extractor is a dandy program for researchers with an occasional need to turn image PDFs into text and a desire to avoid the expense of hardware scanners and elaborate OCR programs.

Want to Suggest a Mac App for Review?

Is there a Mac app you'd like to suggest for review? Something you think other Mac users would love to know about? Something you find intriguing but are hesitant to buy?

Please send your ideas to me, and I'll consider them for a future Mac app review.

And use the Talkback feature below to add your comments!

John Mello is a freelance technology writer and former special correspondent for Government Security News.

download NICE inContact Remote Agent Checklist
Which technology has the strongest positive or negative impact on race relations?
Smartphone cameras, by holding people accountable.
Twitter, by reporting news as it happens.
Facebook, by providing a platform for discussing the issues.
YouTube, by exposing viewers to other cultures.
Twitter, by fueling antagonisms.
Facebook, by spreading fake news.