Wilm Posted March 31, 2010 Report Share Posted March 31, 2010 Hey all, i have a pdf with 1 emailadres in its text. Now i want to store that mailadres in a variable but in silent mode. First i use the 'program launch' command and some parameters to convert the pdf to a textfile. And then store the contents of that file to variable %T[1]% Now i want to get that mailadres into a variable. I thought of some ways to do that, for example to split string the text and then something like IF %txt[1] contains @ set VAR %mail% to %txt%[1]% END IF But i think this is not the easiest way to do this. Who can help me ? Greetings, Wilm Quote Link to comment Share on other sites More sharing options...
Yehnfikm8Gq Posted March 31, 2010 Report Share Posted March 31, 2010 Your outline would be the easiest, if it were really that easy! Here is an outline of one method, name variables as you wish: Set integer N1 to the position of @ in T1 Set N2 = N1 Set N3 = N1 Find Start of Address Repeat 50 times Copy part of text, 1 character at position N2 of T1 into T2 If T2=blank "" N2 increment Repeat Exit Else N2 decrement End If End Repeat Find End of Address + 1 character Repeat 50 times Copy part of text, 1 character at position N3 of T1 into T3 If T3=blank "" Repeat Exit Else N3 increment End If End Repeat N4 = N3-N2 (number of characters in address) Copy N4 characters of T1 starting at N2 into T4, the email address Search T4 for substrings .com, .net, .org etc . If none found, Text Box Display "T4 (display value) is probably not a valid email address" Else Text Box Display "T4 (display value)" End If If the address finishes at the end of a line with a CRLF you will have to modify the search for the end of address. You could for example look for the next "." after N1 into N3 and add 4 (.com etc). You could use that anyway instead of looking for a following space. Note that I have used (end character + 1) to save 2 lines of maths. Quote Link to comment Share on other sites More sharing options...
Cory Posted March 31, 2010 Report Share Posted March 31, 2010 I can think of two ways. First find the position of the @ then count down evaluating each character until I found a space. Increment the start position. Then count up the same way until I found a space and decrement. Now Var Mod String Copy Part of Text from start to end. If there isn't too much text you could split the string into an array using spaces. Then start at element one and check if the string contains an @. Repeat until an @ is found. That should be the complete email address. FYI there are some command line utilities to extract text from a PDF that can run completely invisibly. A quick Google search should find something that works for you. Quote Link to comment Share on other sites More sharing options...
Wilm Posted March 31, 2010 Author Report Share Posted March 31, 2010 Thanks for your inputs ! This was the sort of code i was looking for Wilm Quote Link to comment Share on other sites More sharing options...
Yehnfikm8Gq Posted March 31, 2010 Report Share Posted March 31, 2010 Cory's method is quicker for ME Pro. You should Trim the final string because there could be non-printing characters like Tabs or CRLF at the start or finish of the string. Only you can tell if there are any other characters such as "Mail To:" immediately before the address. Quote Link to comment Share on other sites More sharing options...
Yehnfikm8Gq Posted April 1, 2010 Report Share Posted April 1, 2010 I had further thoughts on this problem. I'm not suggesting Wilm bothers with this degree of detail. Although both Cory and I used spaces to find the ends of the address, it's not foolproof. There is no foolproof method. The address could be preceded by "Mail to:" as I mentioned before, or this: The address is: myname@myaddress.com Look Ma, no spaces either side The method that is most likely to work in all circumstances is the one I detailed, or Cory's modified, but not looking for spaces. Basically you would search forward and back from the @ for the first illegal character, which in most cases will be a text control character such as Space(0), CR(13), LF(10), or Tab(9). Best done using ASCII character value. These are the basic permitted characters in the English language: 33 ! 35-39 #$%&' 42 * 43 + 45 - 47 / 48-57 Digits 0 to 9 63 ? 64 @ 65-90 Uppercase English letters A-Z 94-96 ^_` 97-122 Lowercase English letters a-z 123-126 { | } ~ Not covered: Hotmail and similar restrictions " " for non-permitted characters IP address used for server domain . if not first, last character or twice consecutively only one @ allowed outside quotes If you found a " in the local part of the name, you would have to search for another " and everything in between would be legal. Max length of local name is 64, domain name 255 but the total address cannot be more that 254. No doubt these things are subject to change. An example of how difficult the task is even if you fully test: "Mail to~myname@myaddress.com" you would get the wrong string "to~myname@myaddress.com" since ~ is legal in an address. It would however retrieve "Mail to:myname@myaddress.com" correctly. Quote Link to comment Share on other sites More sharing options...
paul Posted April 1, 2010 Report Share Posted April 1, 2010 @ is also a perfectly legitimate character in other contexts, so you need to be certain that this character is used only for email addresses. Quote Link to comment Share on other sites More sharing options...
Yehnfikm8Gq Posted April 1, 2010 Report Share Posted April 1, 2010 Unfortunately, if it's someone else's document you don't get to chose how many @ there are. It's a common abbreviation for pricing, 16 units @ $5 each. If you were trying to extract multiple email addresses mixed with scattered @ you would have to do something like I mentioned in my first post - check the strings thought to be email addresses contain/end with .org, .net, .com etc. Unfortunately the possibilities there are going up constantly. You could also rule out short strings containing @ assuming the shortest likely address would be perhaps ab@bc.com - 9 characters Quote Link to comment Share on other sites More sharing options...
Cory Posted April 1, 2010 Report Share Posted April 1, 2010 A normal @ will usually stand alone so one could add the logic to my solution that if the beginning and the end points are the same delete and repeat. Also I have a macro subroutine which validates email addresses. It is based off of all the rules like allowed characters, TLDs, valid number of characters and so on. However it's still possible it's not a valid email address but it will assure it's formatted properly. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.