Jump to content
Macro Express Forums
Jeff

OCR (Optical Character Recognition)

Recommended Posts

In my last job, we encountered a lot of situations where we wanted to gather data from archival documents like PDFs where the text was very regular, consistent, and always in the same font and size. I had developed the means to read targeted text by comparing pixel colors to the background color. I was able to "read" data off of these documents very quickly and it streamlined a lot of processes. It worked best with standard fonts, for example non-italic, non-bold. I created a character map database that lived in a submacro and compared pixel groupings in the document to the database. The data could be transferred to any form, file, or website as required. The process started out in VB, but was gradually transferred into ME.

Share this post


Link to post
Share on other sites

What's your question?

Share this post


Link to post
Share on other sites

BTW Tesseract OCR worked well for me. Also Omnipage. And I have used free online OCR engines, however I would never use those for sensitive data.

Share this post


Link to post
Share on other sites
19 hours ago, Jeff said:

In my last job, we encountered a lot of situations where we wanted to gather data from archival documents like PDFs where the text was very regular, consistent, and always in the same font and size. I had developed the means to read targeted text by comparing pixel colors to the background color. I was able to "read" data off of these documents very quickly and it streamlined a lot of processes. It worked best with standard fonts, for example non-italic, non-bold. I created a character map database that lived in a submacro and compared pixel groupings in the document to the database. The data could be transferred to any form, file, or website as required. The process started out in VB, but was gradually transferred into ME.

 

Now that's impressive!!!  You did OCR strictly with ME macros??? 

 

How much data did you have to get off a typical document -- a few key characters, or whole lines, or pages of text? 

How large were the fonts, or did you blow up the PDFs to make gigantic letters? 

Even though you were working with known fonts and sizes, there must have been slight variations between the pixels "read" and the standardized pixel maps.  How did you adjust for the differences?  How did you determine that a character was, or was not, a match to one of the maps?  By sampling a few dozen, or a few hundred, pixels within a known space on the screen?  

When you say you could read "very quickly", what does that mean in characters per second, or however you measured it? 

 

Sounds like wicked good programming fun!!! 

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...

×
×
  • Create New...