Wednesday, May 21, 2008

Image Scanning and OCR with Ubuntu

I was going to install a scsi card and hook up the spare HP Scanjet 3C to test out scanning. I have two of these beasts, one is installed on the old windows server and the other is the backup. Used to have 3 but that seemed excessive so i gave it away.

Anyway, i wanted to add a second gig of ram to the Comap EVO W6000 that is running Ubuntu so grabbed an Adaptec scsi card from the pile but then a Canon Canoscan FB636U usb scanner came my way (courtesy of the dumpster). This is a neat little compact scanner, small and light that gets it's power from a USB connection - so no power supply.

I used to have a couple of early Afga USB scanners for mac but OSX had poor scanner support so i got rid of them. So was curious to see how Ubuntu would work with this discard.

From the menu i selected Applications, Graphics, Xsane image Scanner. Really cool - it 'scanned' for devices and found the Canon and allowed me to calibrate it and so i scanned an image and here is the result.

Not bad, eh? Looks like it supports scanning negatives and slides. We choose to save the scan to the desktop as a jpg file and click the SCAN button. Pretty easy to do and just as good as the Document Image Scanning function in Microsoft Office.

This is just a way to acquire an image, we also want to try some Optical Character Recognition. Doing OCR requires some specialized software to 'scan' the image scanned by the scanner and to convert it into text. I has used textbridge and omniscan on windows before. So we head off to the package manager to search for OCR. We decide to install Clara and also another OCR package.

We find:

Clara OCR is intended for large scale digitalization projects. It features a
powerful GUI and a web interface for cooperative digitalization of books.

gocr is a multi-platform OCR (Optical Character Recognition) program. While command line, there is a GTK GUI to gocr.

GNU Ocrad is an OCR (Optical Character Recognition) program based on a
feature extraction method.

Tesseract - A commercial quality OCR engine originally developed at HP between 1985
and 1995. In 1995, this engine was among the top 3 evaluated by UNLV. It
was open-sourced by HP and UNLV in 2005. http://sourceforge.net/projects/tesseract-ocr

Not listed in the package manager but available at http://code.google.com/p/ocropus/ is OCRopus, a state-of-the-art document analysis and OCR system, featuring pluggable layout analysis, pluggable character recognition, statistical natural language modeling, and multi-lingual capabilities. This seems to use the Tesseract engine and is tested to run under Ubuntu. So we download tesseract-2.03 and extract the folder. Just looking at the notes quickly - this looks like pretty rough software - alpha.

We don't see a menu item added for clara which we installed. Will have to take a stab at OCR tomorrow.

We now see that in the xsane preferences menu under the OCR tab that the ocr command box has gocr listed. What we can't see is how to run the ocr. There seems to be no control on the interface to do ocr.

Update: hah - this scanner worked perfectly in Ubuntu when i plugged it in. So today i plugged it into Windows XP. The os chugged a bit and eventually declared it could not install the device - a simple USB scanner. Chalk one up for Ubuntu!

5 comments:

Jaap said...

install gocr: (in ubuntu)
open terminal
type gocr.
you will se a way to install gocr.

Unknown said...

Noticed the lack of update but ran into the same problem as you. Found answer by scanning the document then doing a save as 'ocr' option for the image.

Not terribly accurate (gocr) but it didn't take me hours to redo a scanned resume at least.

Jaap's comment is pretty useless. You've already installed it if you see the OCR tab, as you mentioned.

Unknown said...
This comment has been removed by the author.
Unknown said...

Hi!

I'd use tessreact-ocr or cuneiform; both provide much better results.

so long
hank

DomFilk said...

if you don't want to install ocr software to your computer, you may try this free online ocr tool.