Jump to content
Macro Express Forums

Looking for a simple OCR solution


Recommended Posts

There are a multitude of OCR and PDF apps out there in the world and I was wondering if any of you have had experience with any and could make a recommendation.

 

My server will receive dropped PDF files which contain text reports. I need to parse and bump that data into a database. But the problem is that the PDFs contain 300DPI raster images of text. IE not 'selectable' text. The PDFs are programmatically generated, IE not scanned, so the image quality is perfect. I can do everything with MEP except the OCR. Ideally I would like a free command line program that would simply take said PDF and output an ASCII text file. I imagine a command line "c:\magicOCR.exe "SourceFile.pdf" "TextOut.TXT".

 

As it is now my fallback is to use OmniPage but I'd like to run this on the server and would like to avoid the cost and would like to avoid the footprint on my server. Suggestions?

Link to comment
Share on other sites

I think you're using the wrong approach! Rather than using OCR (the very best of which will still give you errors), why not use a PDF to text converter? There are several such pieces of software available, which you can find using Google.

Link to comment
Share on other sites

I never would have thought of that as I think it's impossible. How would any program convert a raster image to text without OCR? Could you give me an example? The only ones I see simply extract existing text data from PDFs. And I think I can do that with VBS.

Link to comment
Share on other sites

That's good sideways thinking but I'd rather handle this locally instead of uploading the PDF to an online service for processing. Also this data if financial so I would have to jump thru hoops to validate them as being secure for this use.

Link to comment
Share on other sites

You probably want something that can monitor a folder and automatically run a PDF batch on files as they arrive.

 

If this is serious work that you are prepared to pay for, rather than looking for shareware or even a freebie, then OmniPage can probably do what you want.

 

It may be more sophisticated than you need, but anything less capable may not handle folder monitoring.

 

Nuance has deals on at the moment and offers 30-day money back.

 

OmniPage OCR Software by Nuance - Optical Character Recognition Software - Document Imaging Solutions

 

I use OmniPage, but not for anything heavyweight like your application.

 

They have a dual pricing scheme with a much reduced upgrade price. Unlike others they do not seem to police this in any way that forces you to actually own an old version. (You should check this, of course.) If you have qualms, I bet you can find a legitimate early version at a knock-down price somewhere.

Link to comment
Share on other sites

Omnipage is my likely fallback. Thanks for the info but the client already owns a copy that I can move to the server. I don't need something that can monitor a folder as I can use MEP for that. But if I do go with OmniPage I might use it's batch processor.

Link to comment
Share on other sites

This seems like a rather roundabout way to solve the problem. Do you know who creates the PDFs? Can you influence the workflow?

Yeah it does seem a long way around doesn’t it? I thought of that straight off but it's not an option. These are being generated by a big payroll company who has a competing service so they are purposefully putting it into a format to make it difficult to automate.

Link to comment
Share on other sites

  • 2 weeks later...

Ugh, intentional obsfucation.

 

Are the documents in image format, or do they merely have DRM restrictions that limit copying/editing/printing? If it's the latter, you may be able to get past the DRM by using a non-Adobe pdf reader:

 

http://www.whoismadhur.com/2009/04/03/5-free-alternatives-to-adobe-reader/

 

For more advanced DRM removal and password recovery, there's:

Advanced PDF Password Recovery from Elcomsoft

 

 

 

Yeah it does seem a long way around doesn’t it? I thought of that straight off but it's not an option. These are being generated by a big payroll company who has a competing service so they are purposefully putting it into a format to make it difficult to automate.

Link to comment
Share on other sites

That sounds almost exactly like what I was looking for. [sound of other shoe dropping] Do you have any suggestions for a good PDF to TIFF command line converter?

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...