Jump to content
Macro Express Forums

Howto get an emailadres out of a pdf ?


Recommended Posts

Hey all,

 

i have a pdf with 1 emailadres in its text.

Now i want to store that mailadres in a variable but in silent mode.

 

First i use the 'program launch' command and some parameters

to convert the pdf to a textfile. And then store the contents of that file to variable %T[1]%

 

Now i want to get that mailadres into a variable.

 

I thought of some ways to do that, for example to split string the text and then something like

IF

%txt[1] contains @

set VAR %mail% to %txt%[1]%

END IF

 

But i think this is not the easiest way to do this.

 

 

Who can help me ?

 

 

 

Greetings, Wilm

Link to comment
Share on other sites

Your outline would be the easiest, if it were really that easy! Here is an outline of one method, name variables as you wish:

 

Set integer N1 to the position of @ in T1

Set N2 = N1

Set N3 = N1

 

Find Start of Address

Repeat 50 times

Copy part of text, 1 character at position N2 of T1 into T2

If T2=blank ""

N2 increment

Repeat Exit

Else N2 decrement

End If

End Repeat

 

Find End of Address + 1 character

Repeat 50 times

Copy part of text, 1 character at position N3 of T1 into T3

If T3=blank ""

Repeat Exit

Else N3 increment

End If

End Repeat

 

N4 = N3-N2 (number of characters in address)

 

Copy N4 characters of T1 starting at N2 into T4, the email address

 

Search T4 for substrings .com, .net, .org etc .

If none found, Text Box Display "T4 (display value) is probably not a valid email address"

Else

Text Box Display "T4 (display value)"

End If

 

If the address finishes at the end of a line with a CRLF you will have to modify the search for the end of address. You could for example look for the next "." after N1 into N3 and add 4 (.com etc). You could use that anyway instead of looking for a following space. Note that I have used (end character + 1) to save 2 lines of maths.

Link to comment
Share on other sites

I can think of two ways.

 

First find the position of the @ then count down evaluating each character until I found a space. Increment the start position. Then count up the same way until I found a space and decrement. Now Var Mod String Copy Part of Text from start to end.

 

If there isn't too much text you could split the string into an array using spaces. Then start at element one and check if the string contains an @. Repeat until an @ is found. That should be the complete email address.

 

FYI there are some command line utilities to extract text from a PDF that can run completely invisibly. A quick Google search should find something that works for you.

Link to comment
Share on other sites

I had further thoughts on this problem. I'm not suggesting Wilm bothers with this degree of detail. Although both Cory and I used spaces to find the ends of the address, it's not foolproof. There is no foolproof method. The address could be preceded by "Mail to:" as I mentioned before, or this:

 

The address is:

myname@myaddress.com

Look Ma, no spaces either side

 

The method that is most likely to work in all circumstances is the one I detailed, or Cory's modified, but not looking for spaces. Basically you would search forward and back from the @ for the first illegal character, which in most cases will be a text control character such as Space(0), CR(13), LF(10), or Tab(9). Best done using ASCII character value. These are the basic permitted characters in the English language:

 

33 !

35-39 #$%&'

42 *

43 +

45 -

47 /

48-57 Digits 0 to 9

63 ?

64 @

65-90 Uppercase English letters A-Z

94-96 ^_`

97-122 Lowercase English letters a-z

123-126 { | } ~

 

Not covered:

Hotmail and similar restrictions

" " for non-permitted characters

IP address used for server domain

. if not first, last character or twice consecutively

only one @ allowed outside quotes

 

If you found a " in the local part of the name, you would have to search for another " and everything in between would be legal. Max length of local name is 64, domain name 255 but the total address cannot be more that 254. No doubt these things are subject to change.

 

An example of how difficult the task is even if you fully test:

"Mail to~myname@myaddress.com" you would get the wrong string "to~myname@myaddress.com" since ~ is legal in an address.

It would however retrieve "Mail to:myname@myaddress.com" correctly.

Link to comment
Share on other sites

Unfortunately, if it's someone else's document you don't get to chose how many @ there are. It's a common abbreviation for pricing, 16 units @ $5 each. If you were trying to extract multiple email addresses mixed with scattered @ you would have to do something like I mentioned in my first post - check the strings thought to be email addresses contain/end with .org, .net, .com etc. Unfortunately the possibilities there are going up constantly. You could also rule out short strings containing @ assuming the shortest likely address would be perhaps ab@bc.com - 9 characters

Link to comment
Share on other sites

A normal @ will usually stand alone so one could add the logic to my solution that if the beginning and the end points are the same delete and repeat.

 

Also I have a macro subroutine which validates email addresses. It is based off of all the rules like allowed characters, TLDs, valid number of characters and so on. However it's still possible it's not a valid email address but it will assure it's formatted properly.

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...