Friday, May 12, 2006

OCR results for Species Plantarum

One of the titles we've digitized is Linnaeus' Species Plantarum, published in 1753. It's arguably the most significant title in plant taxonomy, and our copy posed some challenges because it's tightly bound and has a fair amount of bleed-through on the page. You can download a representative page (4MB TIF or 1MB JPG) to see what I mean. This page in particular is important because it's where the scientific name for corn, Zea mays, was first published.

I expected the OCR to be bad, but was shocked at how miserable it was with our standard settings! View the text to see what I mean. Terrible, eh?!

So, I tried to make it better. I used Prime's internal image cleaning routines (deskew, despeckle, noise reduction) to see if there was improvement. There was, but it still wasn't enough. Check out the file.

Finally, I went back to the original TIF (a copy, actually) and using Photoshop changed the Threshold to approximate a good bitonal image. Here's a lower-res GIF to view as an example (OCR was on the full-res TIF). I had heard antecdotally that bitonal images resulted in more accurate OCR. I would have to agree at least for this one page - view the results to see the dramatic improvement. Far from perfect, but a heck of a lot better than the earlier tests!

Martin Kalfatovic at Smithsonian kindly offered to run the TIF through LuraTech's software, which uses ABBY as the OCR engine. The results on the unedited TIF were actually pretty good (certainly much better than the results from Prime). View the file.

What this says to me is that we should evaluate the standard OCR results, and if they're bad (how to judge?), maybe make derivatives just for the purpose of OCR. Or, switch to ABBY!


At 8:32 PM, Blogger Peter Bostock said...

One significant problem with all OCR packages is that they rely on a dictionary to determine likely words, and there is no dictionary for (Botanical or Classical)Latin. Even though ABBYY says it has a dictionary of Latin, it is just determining the likely character set, as far as I can tell.


