Challenge: "Scrape" a document for email addresses

acantor · August 9, 2023

I often need to extract email addresses that appear in documents, spreadsheets, email messages, and webpages. I used to do this manually, but recently, I realized I should be using a macro to do the heavy lifting... at least most of the heavy lifting.

So here's the challenge:

Write a Macro Express script to analyze whatever is in the clipboard, and display only the email addresses it contains.

For example, if you copied this to the clipboard:

Quote

"Blb bla bla xx@yy.com bla bla

bla aa@bb.info bla bla bla"

The macro returns this:

Quote

xx@yy.com

aa@bb.info

Although I've set MEP challenges in the past, I don't think I've ever made this a requirement: Make your macro as short as possible, with the fewest number of lines of code as you can. (Comments and blank lines don't count.)

I'm still working on this challenge. I have an MEP macro that sort-of works, but it's not a solution I can live with... yet!

My suggested rules for this challenge:

1. Your script shouldn't use RegEx. (However, I'd be curious to see how this challenge can be done via RegEx.)

2. Don't worry if your script doesn't handle free-floating at-signs that don't form part of an email address, e.g., "See you @ noon" ... unless you want to!

3. Ensure your script is capable of handling between zero and, say, 1000 email addresses.

Cory · August 9, 2023

"Your script shouldn't use RegEx" LMAO. Well OK. So let's build a house with a hammer and handsaw when we have free access to an entire trailer of construction tools.

I gave up writing scrapers in MEP because of its limitations. There's no point in reinventing the wheel. I wrote many macros to do things like this and then I would learn about some rule I wasn't aware of, some odd case. Or even how one establishes word boundaries. Could be a space, could be a comma or other punctuation, beginning or end of line, a tab... That alone is like 100 lines of code and don't even get me started on valid characters, subdomains... It's huge. In RegEx a word boundary is "\b". Done. In the 70's some smart guys got together and realized they were often needing to process text like this and invented RegEx. Innumerable man hours have been added since then improving it. And it's free. I'd rather use an external script with one like of code for the RegEx so I'll decline the challenge 🙂

Aside: I created a program that MEP could use to do Regex without external script. It could be visible or invisible and manipulated by Windows Control commands. Worked great There was zero interest here in the forum.

Good luck reinventing RegEx. Just kidding (only a little)

acantor · August 9, 2023

Quote

Or even how one establishes word boundaries. Could be a space, could be a comma or other punctuation, beginning or end of line, a tab... That alone is like 100 lines of code and don't even get me started on valid characters, subdomains... It's huge.

You're right. Macro Express is not ideal for this task. But I'm still curious how others will go about solving the problem. For me, it's interesting to find out how far one can go despite Macro Express's constraints.

You may be slightly overestimating how much code is involved in solving this puzzle with MEP.

My first attempt was about 65 lines long -- including figuring out the word boundaries.

Then I tried another way. The script shrunk to about 40 lines, but was too spaghetti like for my liking.

My most recent attempt is smaller. A lot smaller. The script probably isn't efficient, but the code to sort out word boundaries is fairly straightforward, even without the undeniable benefits of RegEx.

Quote

Aside: I created a program that MEP could use to do Regex without external script. It could be visible or invisible and manipulated by Windows Control commands. Worked great There was zero interest here in the forum.

I'm interested!

Quote

Good luck reinventing RegEx. Just kidding (only a little)

🤣

rberq · August 9, 2023

This is working pretty well with Notepad text.

It relies on the fact that an email address will
1) Be a contiguous string of characters
2) Contain no embedded blanks
3) Contain one embedded @ sign
4) Be followed by a blank, or end of line.
There are very likely more tweaks needed that I haven't discovered.

About 20 lines, if you don't count setting up miscellaneous variable values.

I ignored your idea of minimizing number of instructions, until my original brute-force method got too elaborate. Then I changed approach to make a macro considerably smaller than the original. So good idea, up to a point. I knew a programmer, back in 1968, who wrote assembly-language code, then would write code that actually overlaid the generated machine language during execution, in order to avoid putting additional IF logic in multiple places. Saved a few bytes, which maybe was useful in 1968 (NOT!) -- but boy it was a bear to debug if he wasn't around. So I'm not a big believer in saving instructions at the expense of clarity.

//
Log Messages to "C:\Temp\MacroExpressProLogFiles\MacroExpressPro_Macro_Log_File.txt"
"Macro executed: (0_A_eMail_Scraper)"
Log Errors to "C:\Temp\MacroExpressProLogFiles\MacroExpressPro_Macro_Log_File.txt"
//
// Extract email addresses from text
//
Program Launch: "challenge.txt" (Normal)
Parameters: // Get test file of text
//
// Set miscellaneous constants
// Tab character ascii 9
Variable Set to ASCII Char 9 to %TAB%
// Line Feed (New Line) character ascii 10
Variable Set to ASCII Char 10 to %LINEFEED%
// Carriage Return character ascii 13
Variable Set to ASCII Char 13 to %CARRIAGERETURN%
// Carriage Return / Line Feed combination characters ascii 13 + ascii 10
Variable Set to ASCII Char 13 to %CRLF%
Variable Modify String %CRLF%: Append Text String Variable (%LINEFEED%)
//
// Save all text in a variable
Text Type (Simulate Keystrokes): <CTRLD>a<CTRLU> // Highlight all text and copy to clipboard
Delay: 250 milliseconds
Text Type (Simulate Keystrokes): <CTRLD>c<CTRLU>
Delay: 250 milliseconds
Text Type (Simulate Keystrokes): <END>
Variable Set String %text% from the clipboard contents // Save initial text in variable

//
// Remove periods and @ not emmbedded in email addresses, also others
Variable Modify String %text%: Trim // Trim left and right ends of text
Variable Modify String: Replace "@@" in %text% with " " // Replaceany double-@, with single space
Variable Modify String: Replace "@ " in %text% with " " // Replace all @ followed by space, with single space
Variable Modify String: Replace " @" in %text% with " " // Replace all @ precedded by space, with single space
Variable Modify String: Replace "%CARRIAGERETURN%" in %text% with " " // Replace all carriage returns, with single space
Variable Modify String: Replace "%LINEFEED%" in %text% with " " // Replace all linefeeds, with single space
Variable Modify String: Replace "%TAB%" in %text% with " " // Replace all tab characters with single space
Variable Modify String: Replace ". " in %text% with " " // Replace all periods followed by space, with single space
Variable Modify String: Replace " ." in %text% with " " // Replace all periods preceded by space, with single space
//
Variable Set String %emails% to "" // set email list null

Variable Modify String %text%: Append Text ( @@) // Append space and double @@ to text being processed -- serves as end of text delimiter
//
// Extract email addresses, stack in variable "emails"
Repeat Until %text% Equals "@@"
Variable Set Integer %index% to the position of " " in %text% // Find space delimiting first word
Variable Modify String: Copy part of text in %text% starting at 1 and %index% characters long to %subtext% // Copy text up to and including first space
If Variable %subtext% Contains "@" // If extracted "word" contains @, assume it is an email address
Variable Modify String %emails%: Append Text String Variable (%subtext%) // Append the email address to the list we are building
Variable Modify String %emails%: Append Text String Variable (%CRLF%) // Append carriage return / line feed to the email address
End If
Variable Modify String: Delete part of text from %text% starting at 1 and %index% characters long // Delete text up to and including first space
End Repeat
//
Text Box Display: Extracted Email Addresses
//
Macro Return
//

*******************************************************************************************
*******************************************************************************************
*******************************************************************************************

<COMMENT Value=" "/>
<LOG MESSAGES Filename="C:\\Temp\\MacroExpressProLogFiles\\MacroExpressPro_Macro_Log_File.txt" Message="Macro executed: (0_A_eMail_Scraper)" Stamp="TRUE"/>
<LOG ERRORS Filename="C:\\Temp\\MacroExpressProLogFiles\\MacroExpressPro_Macro_Log_File.txt" Hide_Errors="FALSE"/>
<COMMENT Value=" "/>
<COMMENT Value="Extract email addresses from text"/>
<COMMENT Value=" "/>
<PROGRAM LAUNCH Path="c:\\temp\\challenge.txt" Mode="\x00" Default_Path="TRUE" Wait="1" Get_Console="FALSE" _COMMENT="Get test file of text"/>
<COMMENT Value=" "/>
<COMMENT Value="Set miscellaneous constants "/>
<COMMENT Value="Tab character ascii 9"/>
<VARIABLE SET TO ASCII CHAR Value="9" Destination="%TAB%"/>
<COMMENT Value="Line Feed (New Line) character ascii 10"/>
<VARIABLE SET TO ASCII CHAR Value="10" Destination="%LINEFEED%"/>
<COMMENT Value="Carriage Return character ascii 13"/>
<VARIABLE SET TO ASCII CHAR Value="13" Destination="%CARRIAGERETURN%"/>
<COMMENT Value="Carriage Return / Line Feed combination characters ascii 13 + ascii 10"/>
<VARIABLE SET TO ASCII CHAR Value="13" Destination="%CRLF%"/>
<VARIABLE MODIFY STRING Option="\x07" Destination="%CRLF%" Variable="%LINEFEED%" NoEmbeddedVars="FALSE"/>
<COMMENT Value=" "/>
<COMMENT Value="Save all text in a variable"/>
<TEXT TYPE Action="0" Text="<CTRLD>a<CTRLU>" _COMMENT="Highlight all text and copy to clipboard"/>
<DELAY Flags="\x12" Time="250"/>
<TEXT TYPE Action="0" Text="<CTRLD>c<CTRLU>"/>
<DELAY Flags="\x12" Time="250"/>
<TEXT TYPE Action="0" Text="<END>"/>
<VARIABLE SET STRING Option="\x02" Destination="%text%" NoEmbeddedVars="FALSE" _COMMENT="Save initial text in variable\r\n"/>
<COMMENT Value=" "/>
<COMMENT Value="Remove periods and @ not emmbedded in email addresses, also others"/>
<VARIABLE MODIFY STRING Option="\x00" Destination="%text%" _COMMENT="Trim left and right ends of text"/>
<VARIABLE MODIFY STRING Option="\x0F" Destination="%text%" ToReplace="@@" ReplaceWith=" " All="TRUE" IgnoreCase="TRUE" NoEmbeddedVars="FALSE" _COMMENT="Replaceany double-@, with single space"/>
<VARIABLE MODIFY STRING Option="\x0F" Destination="%text%" ToReplace="@ " ReplaceWith=" " All="TRUE" IgnoreCase="TRUE" NoEmbeddedVars="FALSE" _COMMENT="Replace all @ followed by space, with single space"/>
<VARIABLE MODIFY STRING Option="\x0F" Destination="%text%" ToReplace=" @" ReplaceWith=" " All="TRUE" IgnoreCase="TRUE" NoEmbeddedVars="FALSE" _COMMENT="Replace all @ precedded by space, with single space"/>
<VARIABLE MODIFY STRING Option="\x0F" Destination="%text%" ToReplace="%CARRIAGERETURN%" ReplaceWith=" " All="TRUE" IgnoreCase="TRUE" NoEmbeddedVars="FALSE" _COMMENT="Replace all carriage returns, with single space"/>
<VARIABLE MODIFY STRING Option="\x0F" Destination="%text%" ToReplace="%LINEFEED%" ReplaceWith=" " All="TRUE" IgnoreCase="TRUE" NoEmbeddedVars="FALSE" _COMMENT="Replace all linefeeds, with single space"/>
<VARIABLE MODIFY STRING Option="\x0F" Destination="%text%" ToReplace="%TAB%" ReplaceWith=" " All="TRUE" IgnoreCase="TRUE" NoEmbeddedVars="FALSE" _COMMENT="Replace all tab characters with single space"/>
<VARIABLE MODIFY STRING Option="\x0F" Destination="%text%" ToReplace=". " ReplaceWith=" " All="TRUE" IgnoreCase="TRUE" NoEmbeddedVars="FALSE" _COMMENT="Replace all periods followed by space, with single space"/>
<VARIABLE MODIFY STRING Option="\x0F" Destination="%text%" ToReplace=" ." ReplaceWith=" " All="TRUE" IgnoreCase="TRUE" NoEmbeddedVars="FALSE" _COMMENT="Replace all periods preceded by space, with single space"/>
<COMMENT Value=" "/>
<VARIABLE SET STRING Option="\x00" Destination="%emails%" NoEmbeddedVars="FALSE" _COMMENT="set email list null\r\n"/>
<VARIABLE MODIFY STRING Option="\x06" Destination="%text%" Value=" @@" NoEmbeddedVars="FALSE" _COMMENT="Append space and double @@ to text being processed -- serves as end of text delimiter"/>
<COMMENT Value=" "/>
<COMMENT Value="Extract email addresses, stack in variable \"emails\""/>
<REPEAT UNTIL Variable="%text%" Condition="\x00" Value="@@"/>
<VARIABLE SET INTEGER Option="\x0E" Destination="%index%" Text_Variable="%text%" Text=" " Ignore_Case="FALSE" _COMMENT="Find space delimiting first word"/>
<VARIABLE MODIFY STRING Option="\x09" Destination="%subtext%" Variable="%text%" Start="1" Count="%index%" NoEmbeddedVars="FALSE" _COMMENT="Copy text up to and including first space"/>
<IF VARIABLE Variable="%subtext%" Condition="\x06" Value="@" IgnoreCase="FALSE" _COMMENT="If extracted \"word\" contains @, assume it is an email address"/>
<VARIABLE MODIFY STRING Option="\x07" Destination="%emails%" Variable="%subtext%" NoEmbeddedVars="FALSE" _COMMENT="Append the email address to the list we are building"/>
<VARIABLE MODIFY STRING Option="\x07" Destination="%emails%" Variable="%CRLF%" NoEmbeddedVars="FALSE" _COMMENT="Append carriage return / line feed to the email address"/>
<END IF/>
<VARIABLE MODIFY STRING Option="\x0A" Destination="%text%" Start="1" Count="%index%" _COMMENT="Delete text up to and including first space"/>
<END REPEAT/>
<COMMENT Value=" "/>
<TEXT BOX DISPLAY Title="Extracted Email Addresses" Content="{\\rtf1\\ansi\\ansicpg1252\\deff0\\deflang1033{\\fonttbl{\\f0\\fnil Tahoma;}}\r\n\\viewkind4\\uc1\\pard\\f0\\fs20 %emails%\r\n\\par }\r\n" Left="Center" Top="Center" Width="541" Height="637" Monitor="0" OnTop="TRUE" Keep_Focus="TRUE" Mode="\x00" Delay="0"/>
<COMMENT Value=" "/>
<MACRO RETURN/>
<COMMENT Value=" "/>

acantor · August 10, 2023

I'll post my solution -- which I'm still tweaking -- later today or tomorrow. Our approaches to solving this puzzle are quite different!

I was intrigued by something you did, which resembles something I tried:

Variable Modify String: Replace "@@" in %text% with " " // Replace any double-@, with single space

This works nicely for a string like "Hello@@Goodbye". It gets transformed into "Hello Goodbye." Perfect!

But what if the string is "Hello@@@Goodbye"? The result is "Hello @Goodbye" -- even if one choose the "Replace All Instances" option.

My first inclination was to repeat the instruction:

Variable Modify String: Replace "@@" in %text% with " " // Replace any double-@, with single space
Variable Modify String: Replace "@@" in %text% with " " // Replace any double-@, with single space

That handles "@@@". But what if documents contains long sequences of repeated @-signs, e.g., "Hello@@@@@@@@@@@@@@@@@@@@@@@@@@@@@Goodbye"? How many duplicate instructions should one include? Three? 100? Googolplex?

Of course, one could wrap the Replace instruction in a loop to delete repeated characters until they're all gone. But there's something unsatisfying about that. Is there a more elegant way? Or a stronger brute-force method?

The RegEx solution is probably trivially simple. Nevertheless, reinventing the wheel can be a valuable learning experience!

rberq · August 10, 2023

25 minutes ago, acantor said:

Variable Modify String: Replace "@@" in %text% with " " // Replace any double-@, with single space

I wasn't intending to clean up the data. I did that ONLY because I have this command later in the code:

Variable Modify String %text%: Append Text ( @@) // Append space and double @@ to text being processed -- serves as end of text delimiter

As an "end-of-text" marker, it is used to finally break out of the REPEAT loop. On the chance that @@ was embedded elsewhere in the data, it would have represented a false "end-of-text". So you are correct, a longer string of @@@@@@ would still leave a false end-of-text. I should have used some much-longer random string like "!#$#%$^&*)(*%^+_~#" to serve as a marker, which would be unlikely to appear in valid data.

This end-of-data trick goes back to the olden days of matching up account records while processing multiple sequentially-ordered files, usually from magnetic tape. With only two tapes, it is not bad. But when you get three, four, or more tapes, the logic of which one to read next becomes horrendous once one or more has reached its end, unless you plug a dummy account number (high -- all nines) at that point.

I am anxious to see your macro since you say your approach is much different.

Cory · August 10, 2023

16 hours ago, acantor said:

You may be slightly overestimating how much code is involved in solving this puzzle with MEP

I don't think so. I've done it before. I think you're underestimating. 😁 Or maybe I'm just an inefficient MEP code writer. I've also done SSN, Dollar amounts, and more. Even using RegEx after a few million pages I often find the RegEx I created was deficient in some obscure way and I would end up spending time studying the RFCs and learning a lot about the valid format. For your over-simplified example, I could probably do it in 60 lines, but looking at it for only a few moments I can think of a few examples now in your set of examples that would cause it to get much more complicated. It's always 3 times more complicated that I initially think. And even if it's only 60 lines, that's a lot more than one line of the RegEx.Matches method.

One thing I fail to concede is that using RegEx is super easy... If you know RegEx. And that's no mean feat to learn. But I will say once one does take the, the applications are numerous and one doesn't have to create the engine every time, one just needs to figure out the best pattern. BTW one wants to use a RegEx developer. I use RegEx Buddy but there are excellent online developers. Also most of the common expressions have been thought out already and available to use.

I was thinking of an odd email address I found once that was _@domain.com so I searched and found a list a RegEx developer compiled of valid and invalid email addresses to help you test. You can iterate through these to gauge the effectiveness of your code. Probably some better test lists out there. Many "oh yeah.... Hadn't thought of that] ones I see in there like firstname+lastname@example.com or email@[123.123.123.123]. And invalids that would be interesting to avoid like email@example@example.com. Are you avoiding double periods? Length limits? And of course if your project can tolerate a small error rate, then no need to complicate your code for these examples. But this is like some of the stuff I ran into.

Also it depends on your data source. I often was tasked with scraping random and indeterminable sources. If you're working within a company say that has rules about email formatting then your task will be much simpler.

I do applaud you pushing the limits. I have written so many sub-macros like this over the years like array bubble sorts and much more that MEP. It's a good exercise I just change my philosophy and I'm not interested in spending the long hours doing that. I'd rather bill hours and get more practical things done with me life. And I have a lot to get done if I'm ever going to get out of this state.

You're good to avoid spaghetti. I would often have to come back and work on a macro years after I wrote it and it would take me a long time to figure out my methodology. So unless it was doing heavy processing, it was better to use more varails than absolutely necessary, more comments, more modularity. Sure it might be twice as long, but it's more important to be able to work on it easily in the future. If it's a non-iterative macro for a user who only uses it 100 times a day, then let it be understandable at the cost of a millisecond.

When I started in .NET one can nest functions inline. So instead of defining a single use variable and executing a method to save in that variable and then use it in a later function, one can just place that first function as a parameter of the second function or method avoiding the creation of the variable. I got really excited because I could write code in much fewer lines. Yup. I forgot my lesson. Coming back later it was much harder to understand. But the cool thing with .Net is you get to have your cake and eat it too. In MEP if I expand my code to be readable, it takes longer. In .NET the MSIL (it uses an intermediate language and 'just in time' compiling) compiler is so smart it rearranges my code into the super effieinct version.

I'll make a new post for my app later. I might send it to you in PM to get your feedback for instructions. I'm burning daylight (Ranch/farm speak for "I need to get back to work").

Cory · August 10, 2023

I was wrong, RegEx isn't perfect either. I was going to get the silver bullet expression to share and.... There isn't one. In my RegEx Buddy is a library of expressions and they include a test subject. There are 12 and none work on all and avoid all invalid. I wanted to share how funny this is in defense of being accused of underestimating as RegEx is much more capable and even it doesn't have a perfect solution. What was perceived by acantor as an overestimation was actually a gross underestimation. This is a great example of my axiom of "How ever complicated you think a thing is, once you research it it's always much more complicated." On the plus side I have some more examples to test for false positives and negatives. I was thinking f some examples to play devli's advocate to challenge your macros, but it's already been done.

Just for fun... Here's the test subject:

Valid addresses:
================
president@whitehouse.gov
ip@1.2.3.123
pharaoh@egyptian.museum
john.doe+regexbuddy@gmail.com
Mike.O'Dell@ireland.com
"Mike\ O'Dell"@ireland.com
IPguy@[1.2.3.4]
The email address president@whitehouse.gov is valid.
fabio@disapproved.solutions has a long TLD
fabio@email.validating.solutions

Invalid addresses:
==================
1024x768@60Hz
not.a.valid.email
invalid@ifon.nonexistingtld
john@aol...com
Mike\ O'Dell@ireland.com
joe@a_domain_name_with_more_than_sixty-four_characters_is_invalid_6465.com
a_local_part_with_more_than_sixty-four_characters_is_invalid_6465@mail.com

This is the simple one. \b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\b It matches 7 of the valid samples but unfortunately it also matches 4 of the invalid. It comes with the note "Use this version to seek out email addresses in random documents and texts. Does not match email addresses using an IP address instead of a domain name. Requires the "case insensitive" option to be ON." Oh yeah, case sensitivity is fun too. In MEP I'd convery it all to lowercase first. But I think for your application this would be adequate.

Oh, and then there's invalid TLDs. That would be an entire routine to itself. And normally I'd think to limit to 3 characters, but they changed that rule and there are many 4 now. I looked it up and there can be as many as 63 characters now in the TLD. Yikes. But if one only wants things that look like a valid TLD, then it's not an issue.

Even this one tries to use RFC compliance rules and still misses 2.

RegExexample.jpg.8379934cd1c439332f10235c9d4dad9e.jpg

Yikes. Like she said when I asked if she had a husband, "It's complicated".

acantor · August 10, 2023

Hi Cory,

I hadn't realized how complicated the rules for valid email addresses can be. After reading your two posts, I did some reading on the topic. At this point, my macro doesn't handle these situations:

A valid email address must contain only one "@".
The part before the @ must be 64 characters or less.
Underscores are allowed before the @, but not in the domain (after the @).

It's not clear, from my limited reading, whether forward slashes, hash marks, plus signs, percent signs, and square brackets are truly valid.

It would take less than a minute to modify my script so it will accept the following as a valid email address:

postmaster@[123.123.123.123]

But it would be a slog to make the macro smart enough to reject this:

postmaster@]123.123.123.123[

Quote

And of course if your project can tolerate a small error rate, then no need to complicate your code for these examples. But this is like some of the stuff I ran into.

I should have explicitly said, when setting the challenge, that a small error rate is acceptable!

acantor · August 10, 2023

Here's my solution. It's 20 lines long, but far from bullet-proof. There's no error checking, and the macro seems to get stuck when I feed it a document that contains thousands of lines of text. Oh well. "A small error rate is acceptable!"

Assume the text is already in the clipboard. In outline, here's how the macro works:

1. Assign the clipboard to a string variable, %Clip%.

2. Check each character. If the character is valid for an email address, append the character to a string variable, %Result%. If the character is NOT valid, append "*" instead.

Example:

%Clip% = "abc@def, 123! uvw@xyz"

%Result% = "abc@def**123**uvw@xyz"

3. Split %Result% at "*" and assign each value to an array, %PossibleEmail[%Count%]%.

From the example above:

%PossibleEmail[1]% = "abc@def"
%PossibleEmail[2]% = ""
%PossibleEmail[3]% = "123"
%PossibleEmail[4]% = ""
%PossibleEmail[5]% = "uvw@xyz"

4. Check each %PossibleEmail[]% for "@". If it contains the symbol, assume we have an email address, and add it to a list.

5. Display the list:

abc@def
uvw@xyz

Variable Set String %ValidChars% to "abcdefghijklmnopqrstuvwxyz1234567890-_@." // Every valid character in an email address
 
Variable Set String %Clip% from the clipboard contents
Variable Set Integer %ClipLength% to the length of variable %Clip%
 
Repeat Start (Repeat %ClipLength% times) // Parse input, one character at a time
  Variable Modify String: Copy part of text in %Clip% starting at %Count% and 1 characters long to %Char%
  If Variable %ValidChars% Contains "%Char%" // This character MIGHT be part of an email address
    Variable Set String %Result% to "%Result%%Char%" // Append the character to %Result%
  Else // This character cannot be part of an email address
    Variable Set String %Result% to "%Result%*" // Append "*" to %Result%. It means any invalid character
    Variable Modify Integer %StarCount%: Increment // Keep track of the number of invalid characters
  End If
End Repeat
 
Variable Modify Integer: %StarCount% = %StarCount% + 1 // Calculate how many times to split %Result%
 
Split String "%Result%" on "*" into %PossibleEmail%, starting at 1
Repeat Start (Repeat %StarCount% times)
  If Variable %PossibleEmail[%Count%]% Contains "@" // Assume an "@" means a string is part of an email address
    Variable Set String %EmailList% to "%EmailList%
%PossibleEmail[%Count%]%" // Create a list of email addresses
  End If
End Repeat
 
Text Box Display: Scraped Email Addresses

<VARIABLE SET STRING Option="\x00" Destination="%ValidChars%" Value="abcdefghijklmnopqrstuvwxyz1234567890-_@." NoEmbeddedVars="FALSE" _COMMENT="Every valid character in an email address"/>
<COMMENT/>
<VARIABLE SET STRING Option="\x02" Destination="%Clip%" NoEmbeddedVars="FALSE"/>
<VARIABLE SET INTEGER Option="\x0D" Destination="%ClipLength%" Text_Variable="%Clip%"/>
<COMMENT/>
<REPEAT START Start="1" Step="1" Count="%ClipLength%" Save="TRUE" Variable="%Count%" _COMMENT="Parse input, one character at a time"/>
<VARIABLE MODIFY STRING Option="\x09" Destination="%Char%" Variable="%Clip%" Start="%Count%" Count="1" NoEmbeddedVars="FALSE"/>
<IF VARIABLE Variable="%ValidChars%" Condition="\x06" Value="%Char%" IgnoreCase="TRUE" _COMMENT="This character MIGHT be part of an email address"/>
<VARIABLE SET STRING Option="\x00" Destination="%Result%" Value="%Result%%Char%" NoEmbeddedVars="FALSE" _COMMENT="Append the character to %Result%"/>
<ELSE _COMMENT="This character cannot be part of an email address"/>
<VARIABLE SET STRING Option="\x00" Destination="%Result%" Value="%Result%*" NoEmbeddedVars="FALSE" _COMMENT="Append \"*\" to %Result%. It means any invalid character"/>
<VARIABLE MODIFY INTEGER Option="\x07" Destination="%StarCount%" _COMMENT="Keep track of the number of invalid characters"/>
<END IF/>
<END REPEAT/>
<COMMENT/>
<VARIABLE MODIFY INTEGER Option="\x00" Destination="%StarCount%" Value1="%StarCount%" Value2="1" _COMMENT="Calculate how many times to split %Result%"/>
<COMMENT/>
<SPLIT STRING Source="%Result%" SplitChar="*" Dest="%PossibleEmail%" Index="1"/>
<REPEAT START Start="1" Step="1" Count="%StarCount%" Save="TRUE" Variable="%Count%"/>
<IF VARIABLE Variable="%PossibleEmail[%Count%]%" Condition="\x06" Value="@" IgnoreCase="FALSE" _COMMENT="Assume an \"@\" means a string is part of an email address"/>
<VARIABLE SET STRING Option="\x00" Destination="%EmailList%" Value="%EmailList%\r\n%PossibleEmail[%Count%]%" NoEmbeddedVars="FALSE" _COMMENT="Create a list of email addresses"/>
<END IF/>
<END REPEAT/>
<COMMENT/>
<TEXT BOX DISPLAY Title="Scraped Email Addresses" Content="{\\rtf1\\ansi\\ansicpg1252\\deff0\\deflang1033{\\fonttbl{\\f0\\fnil\\fcharset0 Tahoma;}{\\f1\\fnil Tahoma;}}\r\n\\viewkind4\\uc1\\pard\\lang4105\\f0\\fs20 %EmailList%\\lang1033\\f1\\fs14 \r\n\\par }\r\n" Left="645" Top="19" Width="664" Height="976" Monitor="0" OnTop="TRUE" Keep_Focus="TRUE" Mode="\x00" Delay="0"/>

rberq · August 10, 2023

58 minutes ago, acantor said:

I should have explicitly said, when setting the challenge, that a small error rate is acceptable!

Damn! Nobody ever said that to me when I was keeping track of which hospital patients we killed, or didn't.😉

rberq · August 10, 2023

Logically your macro is very similar to mine -- substitute for invalid characters, split into short strings based on the substitution character, check the individual strings for valid email format. I started out like you did, with an array, just because I wanted to play with the Split command which I had never used before. But then I didn't want to worry about how big to make the array, so I went directly from separating the strings, to placing the valid ones in the output list, rather than stage them in an array in the interim.

But if the customer looks at the result and says, "By the way, did mention that I want the email list in alphabetical order," then you are way ahead having the array all ready to sort.

Cory · August 10, 2023

2 hours ago, acantor said:

I should have explicitly said, when setting the challenge, that a small error rate is acceptable!

Yup. My clients usually have an idea about something and want X. But they don't really understand the nature of X. So really they want Y. Y = their flawed concept of X. A good example is when Apple made the iPod beta users wanted a random shuffle mode. When they tested it they reported a flaw in the feature because sometimes a song repeated 2 or many times.... Ummmm... That is the nature random. Even with 60 songs one will hear some repeated as many as 5 times. So they changed it to the consumer's idea of random.

I love the quote about AI generation art, or code. I paraphrase "AI requires a clear and considered definition of the requirements for the product from the customer.... I think we're safe." LOL.

acantor · August 10, 2023

Quote

But then I didn't want to worry about how big to make the array, so I went directly from separating the strings, to placing the valid ones in the output list, rather than stage them in an array in the interim.

I get it. The fact the size of the array can't be known in advance bugs me. One of my decisions, after I committed to using an array, was to pick a size that will usually work. I chose 999,999! Very kludgey...

Quote

But if the customer looks at the result and says, "By the way, did mention that I want the email list in alphabetical order," then you are way ahead having the array all ready to sort.

I don't know how to do that! I may have learned how to sort an array in a computer science course I took about a million years ago, which I failed. My grade was something like 36%!

This afternoon, I used my macro "for real" for the first time. I needed to copy email addresses from a horrible and inaccessible web-based email client. I pressed Ctrl + A to select the entire web page and Ctrl + C to copy it to the clipboard. I triggered the macro. It worked!

Cory · August 10, 2023

2 hours ago, acantor said:

I triggered the macro. It worked!

Congratulations

rberq · August 11, 2023

4 hours ago, acantor said:

I may have learned how to sort an array in a computer science course I took about a million years ago, which I failed. My grade was something like 36%!

Bubble sort an array

//
// Sort array of process names
Get Array Length (%ProcessNames%) => %arraylength%
Variable Set Integer %sortindex1% to 0
Variable Set Integer %sortindex2% to 0
Variable Set Integer %sortlimit% to %arraylength%
Variable Modify Integer %sortlimit%: Decrement
Repeat Until %sortindex1% Equals "%sortlimit%"
Variable Modify Integer %sortindex1%: Increment
Variable Modify Integer: %sortindex2% = %sortindex1% + 1
If Variable %ProcessNames[%sortindex1%]% Equals ""
    Repeat Exit
End If
Repeat Until %sortindex2% Is Greater Than "%arraylength%"
    If Variable %ProcessNames[%sortindex2%]% Equals ""
      Repeat Exit
    End If
    If Variable %ProcessNames[%sortindex1%]% Is Greater Than "%ProcessNames[%sortindex2%]%"
      Variable Modify String: Copy Text from %ProcessNames[%sortindex1%]% to %tempname%
      Variable Modify String: Copy Text from %ProcessNames[%sortindex2%]% to %ProcessNames[%sortindex1%]%
      Variable Modify String: Copy Text from %tempname% to %ProcessNames[%sortindex2%]%
    End If
    Variable Modify Integer %sortindex2%: Increment
End Repeat
End Repeat
//

Cory · August 11, 2023

Good job. I was going to look for mine, but didn't have time.

acantor · August 11, 2023

rberq, your bubble sort code helps me understand why I flunked a computer science course. I think I slept through the lessons on sorting!

(Or, perhaps I failed because students couldn't access a computer that ran Pascal, and the entire course was about programming in Pascal!)

So lacking skills to develop any kind of Pascal program, and in admiration of your example of an array sort using Macro Express , here's my way, using Macro Express to alphabetize email addresses by way of MS-DOS. 🤣

Variable Set String %Emails% to "zzz@abc.com
yyy@xyz.com
xxx@123.com" // Unsorted email addresses
 
Variable Set String %FileStart% to "c:\tmp\Sort1.txt" // Unsorted email addresses to be saved to this file
Variable Set String %FileEnd% to "c:\tmp\Sort2.txt" // Sorted email addresses to be saved in this file
 
Variable Modify String: Save %Emails% to "%FileStart%"
 
Program Launch: "cmd" (Normal)
Parameters:  // Start a DOS session...
 
Text Type (Simulate Keystrokes): sort %FileStart% > %FileEnd%<ENTER> // Output sort instructions to the command line
Text Type (Simulate Keystrokes): exit<ENTER> // Exit the DOS session
 
Delay: 1000 milliseconds
Variable Set String set %EmailsSorted% to the contents of %FileEnd%
 
Text Box Display: Sorted Email Messages

<VARIABLE SET STRING Option="\x00" Destination="%Emails%" Value="zzz@abc.com\r\nyyy@xyz.com\r\nxxx@123.com" NoEmbeddedVars="FALSE" _COMMENT="Unsorted email addresses"/>
<COMMENT/>
<VARIABLE SET STRING Option="\x00" Destination="%FileStart%" Value="c:\\tmp\\Sort1.txt" NoEmbeddedVars="FALSE" _COMMENT="Unsorted email addresses to be saved to this file"/>
<VARIABLE SET STRING Option="\x00" Destination="%FileEnd%" Value="c:\\tmp\\Sort2.txt" NoEmbeddedVars="FALSE" _COMMENT="Sorted email addresses to be saved in this file"/>
<COMMENT/>
<VARIABLE MODIFY STRING Option="\x11" Destination="%Emails%" Filename="%FileStart%" Strip="FALSE" NoEmbeddedVars="FALSE"/>
<COMMENT/>
<PROGRAM LAUNCH Path="cmd" Mode="\x00" Default_Path="TRUE" Wait="1" Get_Console="FALSE" _COMMENT="Start a DOS session..."/>
<COMMENT/>
<TEXT TYPE Action="0" Text="sort %FileStart% > %FileEnd%<ENTER>" _COMMENT="Output sort instructions to the command line"/>
<TEXT TYPE Action="0" Text="exit<ENTER>" _COMMENT="Exit the DOS session"/>
<COMMENT/>
<DELAY Flags="\x02" Time="1000"/>
<VARIABLE SET STRING Option="\x03" Destination="%EmailsSorted%" Filename="%FileEnd%" Strip="FALSE" NoEmbeddedVars="FALSE"/>
<COMMENT/>
<TEXT BOX DISPLAY Title="Sorted Email Messages" Content="{\\rtf1\\ansi\\ansicpg1252\\deff0\\deflang1033{\\fonttbl{\\f0\\fnil\\fcharset0 Tahoma;}{\\f1\\fnil Tahoma;}}\r\n\\viewkind4\\uc1\\pard\\lang4105\\f0\\fs20 %EmailsSorted%\\lang1033\\f1\\fs14 \r\n\\par }\r\n" Left="821" Top="417" Width="317" Height="375" Monitor="0" OnTop="TRUE" Keep_Focus="TRUE" Mode="\x00" Delay="0"/>

rberq · August 12, 2023

Clever. Makes sense to use tools that already exist. But I had more fun. 🙂

acantor · August 12, 2023

In my nostalgic and geeky world, scripting a macro that causes a DOS prompt to appear is fun, too! I can't recall the last time I shelled out to a DOS prompt to actually get something done. 1998, maybe??

acantor · August 13, 2023

I got the script down to 19 lines, although the logic may not be as easy to follow as before. (The former IF-THEN-ELSE is now IF-THEN.)

I tried to compensate for the reduction in transparency by reorganizing the script and revising some comments.

Variable Set String %ValidChars% to "abcdefghijklmnopqrstuvwxyz1234567890-_@." // Every valid character in an email address
 
Variable Set String %Clip% from the clipboard contents
Variable Set Integer %ClipLength% to the length of variable %Clip%
 
Repeat Start (Repeat %ClipLength% times) // Parse input, one character at a time
  Variable Modify String: Copy part of text in %Clip% starting at %Count% and 1 characters long to %Char%
  If Variable %ValidChars% Does not Contain "%Char%" // %Char% is NOT valid in an email address
    Variable Set String %Char% to "*" // Substitute * for the invalid character
    Variable Modify Integer %StarCount%: Increment // Keep track of the number of invalid characters
  End If
  Variable Set String %Result% to "%Result%%Char%" // Append %Char% to %Result%
End Repeat
 
Split String "%Result%" on "*" into %PossibleEmail%, starting at 1
 
Variable Modify Integer: %StarCount% = %StarCount% + 1 // Number of possible email addresses to check for "@"
 
Repeat Start (Repeat %StarCount% times)
  If Variable %PossibleEmail[%Count%]% Contains "@" // Assume an "@" means a string is part of an email address
    Variable Set String %EmailList% to "%EmailList%
%PossibleEmail[%Count%]%" // Create a list of email addresses
  End If
End Repeat
 
Text Box Display: Scraped Email Addresses

<VARIABLE SET STRING Option="\x00" Destination="%ValidChars%" Value="abcdefghijklmnopqrstuvwxyz1234567890-_@." NoEmbeddedVars="FALSE" _COMMENT="Every valid character in an email address"/>
<COMMENT/>
<VARIABLE SET STRING Option="\x02" Destination="%Clip%" NoEmbeddedVars="FALSE"/>
<VARIABLE SET INTEGER Option="\x0D" Destination="%ClipLength%" Text_Variable="%Clip%"/>
<COMMENT/>
<REPEAT START Start="1" Step="1" Count="%ClipLength%" Save="TRUE" Variable="%Count%" _COMMENT="Parse input, one character at a time"/>
<VARIABLE MODIFY STRING Option="\x09" Destination="%Char%" Variable="%Clip%" Start="%Count%" Count="1" NoEmbeddedVars="FALSE"/>
<IF VARIABLE Variable="%ValidChars%" Condition="\x07" Value="%Char%" IgnoreCase="TRUE" _COMMENT="%Char% is NOT valid in an email address"/>
<VARIABLE SET STRING Option="\x00" Destination="%Char%" Value="*" NoEmbeddedVars="FALSE" _COMMENT="Substitute * for the invalid character"/>
<VARIABLE MODIFY INTEGER Option="\x07" Destination="%StarCount%" _COMMENT="Keep track of the number of invalid characters"/>
<END IF/>
<VARIABLE SET STRING Option="\x00" Destination="%Result%" Value="%Result%%Char%" NoEmbeddedVars="FALSE" _COMMENT="Append %Char% to %Result%"/>
<END REPEAT/>
<COMMENT/>
<SPLIT STRING Source="%Result%" SplitChar="*" Dest="%PossibleEmail%" Index="1"/>
<COMMENT/>
<VARIABLE MODIFY INTEGER Option="\x00" Destination="%StarCount%" Value1="%StarCount%" Value2="1" _COMMENT="Number of possible email addresses to check for \"@\""/>
<COMMENT/>
<REPEAT START Start="1" Step="1" Count="%StarCount%" Save="TRUE" Variable="%Count%"/>
<IF VARIABLE Variable="%PossibleEmail[%Count%]%" Condition="\x06" Value="@" IgnoreCase="FALSE" _COMMENT="Assume an \"@\" means a string is part of an email address"/>
<VARIABLE SET STRING Option="\x00" Destination="%EmailList%" Value="%EmailList%\r\n%PossibleEmail[%Count%]%" NoEmbeddedVars="FALSE" _COMMENT="Create a list of email addresses"/>
<END IF/>
<END REPEAT/>
<COMMENT/>
<TEXT BOX DISPLAY Title="Scraped Email Addresses" Content="{\\rtf1\\ansi\\ansicpg1252\\deff0\\deflang1033{\\fonttbl{\\f0\\fnil\\fcharset0 Tahoma;}{\\f1\\fnil Tahoma;}}\r\n\\viewkind4\\uc1\\pard\\lang4105\\f0\\fs20 %EmailList%\\lang1033\\f1\\fs14 \r\n\\par }\r\n" Left="645" Top="19" Width="664" Height="976" Monitor="0" OnTop="TRUE" Keep_Focus="TRUE" Mode="\x00" Delay="0"/>

pferris · September 14, 2023

Cory,

I'm with ya... just detecting email address 'valid format' is fairly straightforward, but checking validity is something else.

Pythonesque regex - IIRC - might be something like [a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+ But thinking about it, even that is way primitive. And, as you aptly pointed out - that's not even minding TLD's or subs, etc.; just something that pretends to pass the sniff test. I'd probably extract them off into a file then try to validate. But I don't see even the beginning of it happening in ~60 lines of MEP code.

In the old days (of the 80's), I probably I could've even gotten frisky and validated things by a telnet connection to port 25 before actually sending the message. I don't know about these days - I expect it'd be blocked to avoid the spammer relays, etc. But then again, who knows? I'd have expected the MGM Grand not to be a victim of a ransomware attack, but here we are.

I don't do much regex or Python anymore (retired from IT 2 years ago) other than for my own pleasure, and I'm still knocking the rust off of my MEP skills (I used to be fairly sharp, but it's definitely a "use it or lose it" thing with me!). But trying to email doing a serious email address extraction WITHOUT regex just doesn't sound like fun (because we all know how much fun regex is, right?). As was alluded to -to me, this would be like "Build me a house, but you can't use anything from that DEWALT, Makita, or Milwaukee stocked tool trailer of yours. Just this coping saw, a box of nails and this slightly used 1000 grit sandpaper I found blowing down the street. " I haven't played with (yet?) it but I think I heard that Excel can incorporate Python now (anyone know if that's current or just coming). If that's true, I could envision a viable solution there more easily than MEP. I probably would pass if this challenge were presented to me from a potential client.

acantor · September 20, 2023

When I set this challenge, I didn't anticipate the issues that folks have raised. I've found these discussions interesting and illuminating.

I acknowledge that RegEx is the way to go if the goal is to extract valid email addresses only. On the other hand, here is the original background to the challenge:

Quote

I often need to extract email addresses that appear in documents, spreadsheets, email messages, and webpages. I used to do this manually, but recently, I realized I should be using a macro to do the heavy lifting... at least most of the heavy lifting.

When I wrote the above, I was thinking about validity mostly in terms of deciding whether a string has the appearance of an email address. So my code does two things: it checks whether a string contains valid email address characters:

abcdefghijklmnopqrstuvwxyz1234567890-_@.

Then the code checks whether a string contains the at-sign. If yes, I assume the string is an email address.

Because of these discussions, I realize my solution is not a good general solution. My code extracts invalid email addresses like these:

@hello

hello@

...@...

.com@hello

hello@.com

hello@hello@hello.com

But for the specific challenge, I think my simple test is adequate. That's because the email addresses contained in my documents, spreadsheets, and email addresses are valid -- they are email addresses I already use! But had the challenge been to harvest email addresses gathered from the wild, my solution is meh!

rberq · September 20, 2023

Close enough for gummint work. And if you send to a few invalid addresses, what's the harm? The mail will just go into the big bit bucket in the clouds and be flushed out with the next rain storm.

acantor · September 20, 2023

Quote

Close enough for gummint work.

Agreed.

Adding a handful of extra tests would weed out some invalid addresses. But then the macro would be a lot longer than 19 lines!

Here are examples of simple tests to filter out invalid email addresses:

If the length < 6... [the shortest possible email must contain at least five characters: x@x.x]

If the number of at-signs is not equal to 1... [I think there can only be one @]

If the first character is an at-sign...

If the last character is an at-sign...

If the number of periods is zero... [I don't think an email address can live without a period]

etc.

Tests like these will filter out duds, but for me, it would not be worth the effort because my source documents normally contain only valid email addresses.

BTW, I use the macro regularly. A client communicates to me via a quirky web-based email application. When using this application, it's difficult to select and copy email addresses from messages via keyboard or mouse, which I need to paste into Outlook. So I select the entire page (Ctrl + A), copy (Ctrl + C), and trigger the macro. It takes the macro a second or two to chew through the text and deliver a list of email addresses.

Challenge: "Scrape" a document for email addresses

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation