Thursday, July 1, 2010

Google Offers OCR for Incoming Docs

now when you import files into Google Docs (JPEG, GIF, PNG, or PDF) you have the option of running optical character recognition on them. This is really huge; this means that instead of just static files, you’ll be able to upload sets of words with which you can do further work. There appear to be some limits on what you can upload/convert (more about that shortly) but I find this really exciting.

When you open up Google Docs and choose Upload, you’ll get a screen to select the file you want to upload with an option for OCR like the one you see here. I didn’t have anything to OCR handy, so I went to Google Books and grabbed Analytical Psychology by Carl Gustav Jung. I set it to upload — it’s about 9MB — and Google Docs chugged away on it for several moments. After waiting a while Google Docs told me “Unable to Convert Document.” Well, phooey. So I went back to Google Books and tried again, this time with Damon Runyon’s Rhymes of the Firing Line. That one was a lot smaller — a little under 2MB.

That one uploaded fine, but it only got the disclaimer from Google Books and the title page, because apparently there’s a limit to how much of a PDF document Google will OCR. >facepalmRotarian. (If you do a search for Google Books for the word psychedelic in magazine content available in full-text, this is the earliest result.) Google Docs processed it very quickly, but apparently didn’t like the two column format as the OCR was very poor. Here’s a sample:

Thc organization of power in competitive national units has reached iu; logical conclusion in the confron-lation of two grcat uppnscd blocs immobilized in thc grip of the cold war. Advance in thc tischnical of weaponry has given us weapons so powc rful than they cannon-we hope-be used: meanwhile na-lions are spcnding so much on amwmcnls that there is not enough lo mccl more than a fraction of other und more important psychosocial needs, Increasing emphasis on material products has lcd to wasteful ovcrcxploilatiun of Nature and a tllrcnlcncd shortage of natural rcsollrccs.

I guess what I’m getting at is that the Google Docs OCR is as it stands a bit on the fickle side; there are some size limits and apparently some layouts work better than others. But I am still excited about this. If it evolves to be a little less finicky and have fewer limits it’ll be an incredibly powerful tool for organizing PDF content.