Cory Posted June 11, 2010 Report Share Posted June 11, 2010 There are a multitude of OCR and PDF apps out there in the world and I was wondering if any of you have had experience with any and could make a recommendation. My server will receive dropped PDF files which contain text reports. I need to parse and bump that data into a database. But the problem is that the PDFs contain 300DPI raster images of text. IE not 'selectable' text. The PDFs are programmatically generated, IE not scanned, so the image quality is perfect. I can do everything with MEP except the OCR. Ideally I would like a free command line program that would simply take said PDF and output an ASCII text file. I imagine a command line "c:\magicOCR.exe "SourceFile.pdf" "TextOut.TXT". As it is now my fallback is to use OmniPage but I'd like to run this on the server and would like to avoid the cost and would like to avoid the footprint on my server. Suggestions? Quote Link to comment Share on other sites More sharing options...
paul Posted June 11, 2010 Report Share Posted June 11, 2010 I think you're using the wrong approach! Rather than using OCR (the very best of which will still give you errors), why not use a PDF to text converter? There are several such pieces of software available, which you can find using Google. Quote Link to comment Share on other sites More sharing options...
acantor Posted June 11, 2010 Report Share Posted June 11, 2010 I am not sure whether this tool works for PDFs that are basically photos of text, but it would be worth a try, given that it is from Adobe, and it was meant to compensate for the inaccessibility of the PDF format for some people with disabilities:: http://www.adobe.com/products/acrobat/access_onlinetools.html Quote Link to comment Share on other sites More sharing options...
Cory Posted June 11, 2010 Author Report Share Posted June 11, 2010 I never would have thought of that as I think it's impossible. How would any program convert a raster image to text without OCR? Could you give me an example? The only ones I see simply extract existing text data from PDFs. And I think I can do that with VBS. Quote Link to comment Share on other sites More sharing options...
Cory Posted June 11, 2010 Author Report Share Posted June 11, 2010 That's good sideways thinking but I'd rather handle this locally instead of uploading the PDF to an online service for processing. Also this data if financial so I would have to jump thru hoops to validate them as being secure for this use. Quote Link to comment Share on other sites More sharing options...
michaelkenward Posted June 13, 2010 Report Share Posted June 13, 2010 You probably want something that can monitor a folder and automatically run a PDF batch on files as they arrive. If this is serious work that you are prepared to pay for, rather than looking for shareware or even a freebie, then OmniPage can probably do what you want. It may be more sophisticated than you need, but anything less capable may not handle folder monitoring. Nuance has deals on at the moment and offers 30-day money back. OmniPage OCR Software by Nuance - Optical Character Recognition Software - Document Imaging Solutions I use OmniPage, but not for anything heavyweight like your application. They have a dual pricing scheme with a much reduced upgrade price. Unlike others they do not seem to police this in any way that forces you to actually own an old version. (You should check this, of course.) If you have qualms, I bet you can find a legitimate early version at a knock-down price somewhere. Quote Link to comment Share on other sites More sharing options...
Cory Posted June 14, 2010 Author Report Share Posted June 14, 2010 Omnipage is my likely fallback. Thanks for the info but the client already owns a copy that I can move to the server. I don't need something that can monitor a folder as I can use MEP for that. But if I do go with OmniPage I might use it's batch processor. Quote Link to comment Share on other sites More sharing options...
lemming Posted June 16, 2010 Report Share Posted June 16, 2010 This seems like a rather roundabout way to solve the problem. Do you know who creates the PDFs? Can you influence the workflow? Why not just configure Acrobat (or whatever) to spit out standard PDF files? -Lemming Quote Link to comment Share on other sites More sharing options...
Cory Posted June 23, 2010 Author Report Share Posted June 23, 2010 This seems like a rather roundabout way to solve the problem. Do you know who creates the PDFs? Can you influence the workflow? Yeah it does seem a long way around doesn’t it? I thought of that straight off but it's not an option. These are being generated by a big payroll company who has a competing service so they are purposefully putting it into a format to make it difficult to automate. Quote Link to comment Share on other sites More sharing options...
perkins Posted July 4, 2010 Report Share Posted July 4, 2010 Convert to .tif and use Tesseract ? Quote Link to comment Share on other sites More sharing options...
lemming Posted July 5, 2010 Report Share Posted July 5, 2010 Ugh, intentional obsfucation. Are the documents in image format, or do they merely have DRM restrictions that limit copying/editing/printing? If it's the latter, you may be able to get past the DRM by using a non-Adobe pdf reader: http://www.whoismadhur.com/2009/04/03/5-free-alternatives-to-adobe-reader/ For more advanced DRM removal and password recovery, there's: Advanced PDF Password Recovery from Elcomsoft Yeah it does seem a long way around doesn’t it? I thought of that straight off but it's not an option. These are being generated by a big payroll company who has a competing service so they are purposefully putting it into a format to make it difficult to automate. Quote Link to comment Share on other sites More sharing options...
Cory Posted July 6, 2010 Author Report Share Posted July 6, 2010 As I said they are raster data. But thanks for the suggestion. Quote Link to comment Share on other sites More sharing options...
Cory Posted July 6, 2010 Author Report Share Posted July 6, 2010 That sounds almost exactly like what I was looking for. [sound of other shoe dropping] Do you have any suggestions for a good PDF to TIFF command line converter? Quote Link to comment Share on other sites More sharing options...
paul Posted July 6, 2010 Report Share Posted July 6, 2010 That sounds almost exactly like what I was looking for. [sound of other shoe dropping] Do you have any suggestions for a good PDF to TIFF command line converter? Universal Document Converter Quote Link to comment Share on other sites More sharing options...
Cory Posted July 6, 2010 Author Report Share Posted July 6, 2010 Thansk buddy! Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.