Wednesday, May 31, 2006

JPEG2000 interface to texts and illustrations

We have used LizardTech's MrSID image format since 2000 to display our herbarium specimen images. It's worked very well for us over the years, but we were always concerned that it was a proprietary format. The latest version of LizardTech's ExpressServer now supports JPEG2000 (*.jp2) images, which are based on an open standard, so we began looking for ways to create an Ajaxian interface on top of ExpressServer. Luckily we found previous work in the form of the original GSV from Michal Migurski and GSIV from David Allen. Jay Paige, one of MBG's intrepid programmers, customized the core app to work with ExpressServer, and voila:

View example text page

View example illustration

View specimen

We're finalizing credits and such on the code and will make it available to anyone who wants it.

Friday, May 19, 2006

Rudimentary page turning for 399 volumes

Today I published a rudimentary page turning interface for the 399 volumes we've scanned to date. That's 232,000+ pages from 23 titles, and represents over a year's worth of digitization. The interface is still VERY basic and does not match the interface I'm proposing. My main reason for doing this was to test our data model in a web environment and it's already paid off - we discovered some missing relationships that we need to incorporate into the next version of the model.

A note about the images: We are scanning materials at 400dpi grayscale with our Indus book scanners and saving the master images as uncompressed TIFs. The images presented on the site are bitonal GIFs and will be replaced with multi-resolutional, grayscale JP2 (JPEG2000) images once we settle on a vendor for encoding.

Please have a look at http://www.botanicus.org/tobescanned.asp and provide feedback!

Friday, May 12, 2006

OCR results for Species Plantarum

One of the titles we've digitized is Linnaeus' Species Plantarum, published in 1753. It's arguably the most significant title in plant taxonomy, and our copy posed some challenges because it's tightly bound and has a fair amount of bleed-through on the page. You can download a representative page (4MB TIF or 1MB JPG) to see what I mean. This page in particular is important because it's where the scientific name for corn, Zea mays, was first published.

I expected the OCR to be bad, but was shocked at how miserable it was with our standard settings! View the text to see what I mean. Terrible, eh?!

So, I tried to make it better. I used Prime's internal image cleaning routines (deskew, despeckle, noise reduction) to see if there was improvement. There was, but it still wasn't enough. Check out the file.

Finally, I went back to the original TIF (a copy, actually) and using Photoshop changed the Threshold to approximate a good bitonal image. Here's a lower-res GIF to view as an example (OCR was on the full-res TIF). I had heard antecdotally that bitonal images resulted in more accurate OCR. I would have to agree at least for this one page - view the results to see the dramatic improvement. Far from perfect, but a heck of a lot better than the earlier tests!

Martin Kalfatovic at Smithsonian kindly offered to run the TIF through LuraTech's software, which uses ABBY as the OCR engine. The results on the unedited TIF were actually pretty good (certainly much better than the results from Prime). View the file.

What this says to me is that we should evaluate the standard OCR results, and if they're bad (how to judge?), maybe make derivatives just for the purpose of OCR. Or, switch to ABBY!

Searching the text of 300 digitized titles

We're running PrimeRecognition's PrimeOCR for text conversion. I've spent several months working out kinks and learning about how OCR really works (or doesn't work) with our historic literature. We have a cache of 83,000 text pages generated from Prime and I wondered how easy it would be to drop in existing services within our network to start interacting with this text. Turns out it was REALLY easy. We're a Windows/.NET shop, so we have several machines running IIS 6.0. I built an out-of-the-box Indexing Service implementation and incorporated it into our beta site at:

http://www.botanicus.org/search.asp

Give it a try - results are interesting!

Prototype interface for scientific literature

I had an "Aha!" moment about 2 weeks ago while reading Pragmatic Ajax by Justin Gehtland, Ben Galbraith, and Dion Almaer (highly recommended, by the way). The authors use Google Maps as a classic example of Ajaxian techniques and it struck me that the interface for Google Maps is almost identical to what we've been envisioning for zooming & panning digitized scientific literature with JPEG2000. One idea led to another and 12 hours later I had the beginnings of a web interface. I started putting ideas into PowerPoint (one of the easiest tools to use for prototyping) and came up with the following:

Download 4MB PowerPoint (zipped)