stevecasper Posted September 28, 2009 Report Share Posted September 28, 2009 Ok, I love MEP as compared to ME3. However I just discovered a HUGE problem with MEP and was wondering if anybody else has come across a similar discrepancy. Cory, you'll be familiar with the general idea of this macro, as it is the one we PMed about last week. Scenario: At work we use ME3 (except for me, because I'm an angry rebel). My boss's boss requested a macro last week that performed a very simple function, but a necessary one. Here's the gist: We have a .csv file 550 lines long (title "Mule"). We also have a .txt file generated 5 times throughout the day. The .txt file ranges anywhere from several hundred KB in size to several hundred MB in size. The macro needs to grab the account number from the comma delimited .csv and check the .txt file for occurrences of the account number within. Only one hit is required, and if the account number does appear, the entire line from the Mule file needs to be entered into a separate .txt file. After much tribulation (thanks in no small part to the sheer crapiness of my PC), I was finally able to make a macro that works wonderfully well in ME3. Today, the bosses told me that they would like me to be the guy who runs this macro throughout the day to gather the necessary information. Not a problem. The largest file so far was 220 MB, and it ran in about 3 minutes. However, since I'm a snob and like the features of MEP, I decided to convert the macro to MEP and "improve" it for my own pleasure. My changes consist of adding a counter and utilizing the "Update Textbox" feature so I can see about how far along the macro has progressed. The first thing I noticed (during a test using a 300 KB .txt file): Running the macro in ME3 was just a little faster than MEP without any changes. Not enough to really matter, but noticeable. The second thing I noticed (still using the 300 KB file): Adding the "Update Textbox" feature slows the macro down a little more. This I expected, so it doesn't bother me. The tradeoff of viewing the progress is worth the insignificant slow-down. The third thing, and this is the killer: I ran the fully MEP-tricked-out macro against a previously processed 31 MB file (clocked at 32 seconds to process the entire file in ME3 (just over 17 lines per second)), and the macro stalled. Or at least, I thought it stalled. The text box sat on "Items processed: 0" for so long I thought the macro had broken. Then it changed to "1". I waited... eventually it changed to "2". At this rate it was going to take an hour and a half to process the entire file. About a 17000% increase over ME3. I didn't dare test it against the 220 MB file. Has anybody else noticed this? Does anybody else think that this is a highly unacceptable discrepancy? Here are the codes for the ME3 and MEP macros. I can't provide examples of the actual files, since it's all very sensitive bank account information... not to mention the fact that the .txt files are gargantuan. Please notice that the changes between the two macros are really very superficial: I renamed variables for MEP-sized convenience, and added just a couple of commands for the Updater. ME3 Script Variable Set String %T14% "G:\Loss Prevention\ACH\Mule Accounts.csv" Text Box Display: Find the Source Variable Set String %T10% from File Name Text Box Close: Find the Source Variable Set String %T15% "%T10%" Replace ".txt" with " " in %T15% Variable Set String %T12% "%T15% Match File.txt" Variable Modify String: Save %T13% to Text File Date/Time: Save "h:mm:ss AMPM" into %T20% Text Box Display: ACH Checker Variable Set String %T11% from File: "%T10%" ASCII File Begin Process: "%T14%" (Comma Delimited Text ) Variable Set String %T4% "%T4% " If Variable %T11% contains variable %T4% Variable Set String %T7% "%T3%, %T4%, %T5%, %T6%" Variable Modify String: Append %T7% to Text File Variable Modify Integer: Inc (%N1%) End If ASCII File End Process Text Box Close: ACH Checker Date/Time: Save "h:mm:ss AMPM" into %T21% Text Box Display: Finished ME3 DE <TVAR2:14:01:G:\Loss Prevention\ACH\Mule Accounts.csv><TBOX4:F:6:000469000050000290000083:000:Find the SourceBrowse to the Source ACH file.><TVAR2:10:09:SOURCE File><TBCLOSE:Find the Source><TVAR2:15:01:%T10%><TMVAR2:21:15:01:001:000:.txt ><TVAR2:12:01:%T15% Match File.txt><TMVAR2:17:13:00:000:000:%T12%F><DT:h:mm:ss AMPMT:20:1:><TBOX4:F:6:000469000050000290000106:000:ACH CheckerThe macro is running. Started: %T20%><TVAR2:11:04:%T10%><ADFBEG:F10:001:000002:000000:%T14%><TVAR2:04:01:%T4% ><IFVAR2:4:11:7:T4><TVAR2:07:01:%T3%, %T4%, %T5%, %T6%><TMVAR2:20:07:00:000:000:%T12%T><NMVAR:08:01:0:0000001:0:0000000><ENDIF><ADFEND><TBCLOSE:ACH Checker><DT:h:mm:ss AMPMT:21:1:><TBOX4:T:1:000570Center000278000200:000:FinishedThe Checker has finished. Matches found: %N1% Started - %T20% Finished - %T21%> MEP Script Variable Set String %MuleLocation% to "G:\Loss Prevention\ACH\Mule Accounts.csv" Text Box Display: Find the Source Variable Set String %ACHSourceFile[1]%: Prompt for a filename Text Box Close: Find the Source Variable Set String %ACHSourceFile[2]% to "%ACHSourceFile[1]%" Variable Modify String: Replace ".txt" in %ACHSourceFile[2]% with " " Variable Set String %ACHMatchName% to "%ACHSourceFile[2]% Match File.txt" Variable Modify String: Save %ACHMatchFile% to "%ACHMatchName%" Date/Time: Set %Start% to the current date/time using "h:mm:ss AMPM" as the format Text Box Display: ACH Checker Variable Set String set %ACHTextContents% to the contents of %ACHSourceFile[1]% ASCII File Begin Process: "%MuleLocation%" (Comma Delimited Text (.csv)) Variable Modify Integer %Processed%: Increment Variable Set String %T[4]% to "%T[4]% " If Variable %ACHTextContents% Contains "%T[4]%" Variable Set String %MatchedClientInfo% to "%T[3]%, %T[4]%, %T[5]%, %T[6]%" Variable Modify String: Append %MatchedClientInfo% to text file, "%ACHMatchName%" Variable Modify Integer %MatchIncrement%: Increment End If Update Textbox: ACH Checker ASCII File End Process Text Box Close: ACH Checker Date/Time: Set %Finish% to the current date/time using "h:mm:ss AMPM" as the format Text Box Display: Finished MEP DE <VARIABLE SET STRING Option="\x00" Destination="%MuleLocation%" Value="G:\\Loss Prevention\\ACH\\Mule Accounts.csv"/> <TEXT BOX DISPLAY Title="Find the Source" Content="{\\rtf1\\ansi\\deff0{\\fonttbl{\\f0\\fnil Tahoma;}}\r\n{\\colortbl;\\red0\\green0\\blue255;}\r\n\\viewkind4\\uc1\\pard\\cf1\\lang1033\\b\\f0\\fs24 Browse to the Source ACH file.\\cf0\\b0\\fs16 \r\n\\par }\r\n" Left="469" Top="50" Width="295" Height="73" Monitor="0" OnTop="TRUE" Keep_Focus="FALSE" Mode="\x02" Delay="0"/> <VARIABLE SET STRING Option="\x08" Destination="%ACHSourceFile[1]%" Value="Browse to the ACH SOURCE .txt File"/> <TEXT BOX CLOSE Header="Find the Source"/> <VARIABLE SET STRING Option="\x00" Destination="%ACHSourceFile[2]%" Value="%ACHSourceFile[1]%"/> <VARIABLE MODIFY STRING Option="\x0F" Destination="%ACHSourceFile[2]%" ToReplace=".txt" ReplaceWith=" " All="TRUE" IgnoreCase="TRUE"/> <VARIABLE SET STRING Option="\x00" Destination="%ACHMatchName%" Value="%ACHSourceFile[2]% Match File.txt"/> <VARIABLE MODIFY STRING Option="\x11" Destination="%ACHMatchFile%" Filename="%ACHMatchName%" CRLF="FALSE"/> <DATE/TIME Format="h:mm:ss AMPM" Flags="\x80" Date="12/30/1899" Day_Offset="0" Month_Offset="0" Year_Offset="0" Hour_Offset="0" Second_Offset="0" Left="Center" Top="Center" Monitor="0" Variable="%Start%" IsDateVar="FALSE" _IGNORE="0x0002"/> <TEXT BOX DISPLAY Title="ACH Checker" Content="{\\rtf1\\ansi\\ansicpg1252\\deff0{\\fonttbl{\\f0\\fnil Tahoma;}{\\f1\\fnil\\fcharset0 Tahoma;}}\r\n{\\colortbl;\\red0\\green0\\blue255;\\red0\\green128\\blue0;}\r\n\\viewkind4\\uc1\\pard\\qc\\cf1\\lang1033\\b\\f0\\fs24 The macro is running.\\cf0\\b0\\fs16 \r\n\\par \\pard \r\n\\par \\pard\\qc\\f1 Items processed so far:\\i \\cf2 %Processed%\\i0\\f0 \r\n\\par \\pard\\cf0 \r\n\\par \\pard\\qc Started: \\b %Start%\r\n\\par }\r\n" Left="469" Top="50" Width="290" Height="128" Monitor="0" OnTop="TRUE" Keep_Focus="FALSE" Mode="\x02" Delay="0"/> <VARIABLE SET STRING Option="\x03" Destination="%ACHTextContents%" Filename="%ACHSourceFile[1]%" Strip="FALSE"/> <ASCII FILE BEGIN PROCESS Filename="%MuleLocation%" Format="CSV" Start_Record="2" Process_All="TRUE" Records="0" Variable="%T%" Start_Index="1"/> <VARIABLE MODIFY INTEGER Option="\x07" Destination="%Processed%"/> <VARIABLE SET STRING Option="\x00" Destination="%T[4]%" Value="%T[4]% "/> <IF VARIABLE Variable="%ACHTextContents%" Condition="\x06" Value="%T[4]%" IgnoreCase="FALSE"/> <VARIABLE SET STRING Option="\x00" Destination="%MatchedClientInfo%" Value="%T[3]%, %T[4]%, %T[5]%, %T[6]%"/> <VARIABLE MODIFY STRING Option="\x12" Destination="%MatchedClientInfo%" Filename="%ACHMatchName%" CRLF="TRUE"/> <VARIABLE MODIFY INTEGER Option="\x07" Destination="%MatchIncrement%"/> <END IF/> <UPDATE TEXTBOX Header="ACH Checker" Content="{\\rtf1\\ansi\\ansicpg1252\\deff0{\\fonttbl{\\f0\\fnil Tahoma;}{\\f1\\fnil\\fcharset0 Tahoma;}}\r\n{\\colortbl;\\red0\\green0\\blue255;\\red0\\green128\\blue0;}\r\n\\viewkind4\\uc1\\pard\\qc\\cf1\\lang1033\\b\\f0\\fs24 The macro is running.\\cf0\\b0\\fs16 \r\n\\par \\pard \r\n\\par \\pard\\qc\\f1 Items processed so far:\\i \\cf2 %Processed%\\i0\\f0 \r\n\\par \\pard\\cf0 \r\n\\par \\pard\\qc Started: \\b %Start%\r\n\\par }\r\n"/> <ASCII FILE END PROCESS/> <TEXT BOX CLOSE Header="ACH Checker"/> <DATE/TIME Format="h:mm:ss AMPM" Flags="\x80" Date="12/30/1899" Day_Offset="0" Month_Offset="0" Year_Offset="0" Hour_Offset="0" Second_Offset="0" Left="Center" Top="Center" Monitor="0" Variable="%Finish%" IsDateVar="FALSE" _IGNORE="0x0002"/> <TEXT BOX DISPLAY Title="Finished" Content="{\\rtf1\\ansi\\ansicpg1252\\deff0{\\fonttbl{\\f0\\fnil Tahoma;}{\\f1\\fnil\\fcharset0 Tahoma;}}\r\n{\\colortbl;\\red0\\green0\\blue255;\\red0\\green128\\blue0;\\red255\\green0\\blue0;}\r\n\\viewkind4\\uc1\\pard\\qc\\cf1\\lang1033\\b\\f0\\fs24 The Checker has finished.\r\n\\par \\cf0\\b0\\f1\\fs16 Items processed: \\cf2 %Processed%\\cf0\\f0 \r\n\\par Matches found: \\cf3\\b\\fs24 %MatchIncrement%\\cf0\\b0\\fs16 \r\n\\par \\pard \r\n\\par Started - %Start%\r\n\\par Finished - %Finish%\r\n\\par }\r\n" Left="570" Top="Center" Width="278" Height="200" Monitor="0" OnTop="FALSE" Keep_Focus="TRUE" Mode="\x00" Delay="0"/> Quote Link to comment Share on other sites More sharing options...
Cory Posted September 28, 2009 Report Share Posted September 28, 2009 I'm happy to see there is someone as long winded as me on this forum! Before you waste too much more of anyone's time how about you isolate exactly what is causing the slowdown. You were a beta tester, right? Back then we identified many things that ran very slowly in MEP and most of those things were fixed but not all. If I remember one of the big killers was creating output variables which grew. For some reason the way MEP increases the memory size is a real slow down and in cases where they were going to be written to file I found that outputting them one at a time to the text file instead of accumulating them in an output variable. Anyway when I was running into nearly exactly the same problem you did I wrote simple 'dyno' macros that would do some operation a gazillion times and calculate the average time. I'm thinking you need to do something similar here. Most of the macros I had like yours I rewrote to take advantage of new capabilities. One huge benefit is the array variables and this might help a lot with you here. For progress indicators I sometimes calculate the percentage instead of "Record X of Y". On really big things this helps keep things quick because I only update the text box if the percentage changes. That way if I have 10,000 records to process I perform 1% of the updates to the text box as I would if I reported each record. Anyway find out where the slow down is then let's see what we can do for you from there. Quote Link to comment Share on other sites More sharing options...
stevecasper Posted September 28, 2009 Author Report Share Posted September 28, 2009 I'm happy to see there is someone as long winded as me on this forum! Before you waste too much more of anyone's time how about you isolate exactly what is causing the slowdown. You were a beta tester, right? Back then we identified many things that ran very slowly in MEP and most of those things were fixed but not all. If I remember one of the big killers was creating output variables which grew. For some reason the way MEP increases the memory size is a real slow down and in cases where they were going to be written to file I found that outputting them one at a time to the text file instead of accumulating them in an output variable. Anyway when I was running into nearly exactly the same problem you did I wrote simple 'dyno' macros that would do some operation a gazillion times and calculate the average time. I'm thinking you need to do something similar here. Most of the macros I had like yours I rewrote to take advantage of new capabilities. One huge benefit is the array variables and this might help a lot with you here. For progress indicators I sometimes calculate the percentage instead of "Record X of Y". On really big things this helps keep things quick because I only update the text box if the percentage changes. That way if I have 10,000 records to process I perform 1% of the updates to the text box as I would if I reported each record. Anyway find out where the slow down is then let's see what we can do for you from there. Cory, I'm flattered that you think I was a beta-tester, because frankly, I think that honor is reserved for folks who have shown ISS that they know generally what is going on. So thank you. I'm convinced that the slow-down is occurring during the "If variable contains" command. I don't have any quantifiable evidence supporting my conviction, though, so I'm creating a 30MB text file (on my home computer, now) in an attempt to duplicate my problem, and maybe isolate exactly where the problem is. You mention "dyno" macros, and based on my recollection of some of your previous posts, I believe you use that term to refer to sub-macros called by the main macro. Is that correct? I'm curious how you would implement such a macro to perform the function I'm striving for here. Also, I like the idea of using a percentage complete, rather than an exact number, and though it will take a little extra code, I can see how it would benefit my macro, by not updating the dialog box on every iteration. My PC has finally completed the production of a 33.7 MB text file full of faux account numbers and made-up names. I'll keep you posted on my results. Quote Link to comment Share on other sites More sharing options...
stevecasper Posted September 28, 2009 Author Report Share Posted September 28, 2009 5 minutes 36 seconds. Decidedly slower than the ME3 version... but remarkably faster than when performed on my work PC. Of course the processing power of my home computer dwarfs the cybernetic dreams of my work PC... but I'm still not impressed with the 5 minutes 34 second increase over the ME3 design for this macro. Even at this much faster rate, the 220 MB file would take a half-hour, rather than the 3 or so minutes it takes running it in ME3... and I have to assume there would be a progressive increase in the processing time as the .txt gets bigger... so I could still be looking at 1.5 hours or more. ----- I modified the code to create a percentage-complete variable, and adjusted the Update dialog so it only changes if the % complete changes. Results: 5 minutes 36 seconds. Doing it this way made no difference to the actual speed that the macro processed the file. It's possible that my code was not perfect as I've never done this before... however, the code is good. I don't imagine a streamlined version of this code would make a world of difference. Here it is, in case you're curious: <VARIABLE MODIFY INTEGER Option="\x05" Destination="%N[1]%" Variable="%D[1]%"/> <VARIABLE MODIFY DECIMAL Option="\x03" Destination="%D[2]%" Value1="%D[1]%" Value2="549"/> <VARIABLE MODIFY DECIMAL Option="\x04" Destination="%D[2]%" Places="2"/> <VARIABLE MODIFY DECIMAL Option="\x02" Destination="%D[2]%" Value1="%D[2]%" Value2="100"/> <VARIABLE MODIFY DECIMAL Option="\x06" Destination="%D[2]%" Variable="%N[3]%"/> <IF VARIABLE Variable="%N[3]%" Condition="\x01" Value="%N[4]%" IgnoreCase="FALSE"/> <UPDATE TEXTBOX Header="ACH Macro Running" Content="{\\rtf1\\ansi\\ansicpg1252\\deff0\\deflang1033{\\fonttbl{\\f0\\fnil\\fcharset0 Tahoma;}{\\f1\\fnil Tahoma;}}\r\n\\viewkind4\\uc1\\pard\\f0\\fs16 The macro is running.\r\n\\par \r\n\\par Start time: %Start%\r\n\\par \r\n\\par Percentage of Items Processed: %N[3]%\r\n\\par Matches Found: %N[2]%\\f1 \r\n\\par }\r\n"/> <VARIABLE SET INTEGER Option="\x00" Destination="%N[4]%" Value="%N[3]%"/> <END IF/> ----- I have just disabled two lines of necessary code, to try and isolate the slowness. I disabled the Set String command (wherein Ts 3-6 were saved to %MatchedClientInfo% ) and also the Variable Modify command where MatchedClientInfo was appended to the text file. I did this with the vague hope that the act of appending the text was what was slowing things down (though I knew it wasn't, since only a small percentage of account numbers will actually be matched; if this had been the problem, only those particular cases would have slowed the macro down). As it is, I'm down to two possibilities: The Process file is processing slowly, or the "If contains" is performing slowly. ----- I just disabled the Process file and replaced it with a basic "Repeat Loop" (creating dummy T3-T6). Result: SLOW as usual. I then re-enabled the Process file and disabled the "If contains" sequence. The macro smoked through in a heart-beat. I now have evidence supporting my conviction and no idea how to make that particular function move faster. Hence, I must stand beside my "MEP FAIL" statement, for the time being. Quote Link to comment Share on other sites More sharing options...
rberq Posted September 28, 2009 Report Share Posted September 28, 2009 ... disabled the "If contains" sequence. The macro smoked through in a heart-beat That was going to be my guess, too, as to where you would find the delay. Perhaps you could take some timings, ME3 vs. MEP, and forward to ISS -- though I suspect someone there has already seen this posting and taken note of your findings. I think you have narrowed it down enough so ISS can zero in on the scan logic and see where they went wrong between ME3 and MEP. It would be interesting to know whether the length of your match string makes a difference in how fast it runs. You could, for example, scan the input string for only the first character of your match string. When found, do a straight compare of the whole match string starting at the "hit" position of the single character. Repeat until the input string is exhausted. I'd be curious whether that would run faster than scanning for the whole match string. It might, if MEP uses different logic to find a single character vs. a string. Quote Link to comment Share on other sites More sharing options...
stevecasper Posted September 29, 2009 Author Report Share Posted September 29, 2009 That was going to be my guess, too, as to where you would find the delay. Perhaps you could take some timings, ME3 vs. MEP, and forward to ISS -- though I suspect someone there has already seen this posting and taken note of your findings. I think you have narrowed it down enough so ISS can zero in on the scan logic and see where they went wrong between ME3 and MEP. It would be interesting to know whether the length of your match string makes a difference in how fast it runs. You could, for example, scan the input string for only the first character of your match string. When found, do a straight compare of the whole match string starting at the "hit" position of the single character. Repeat until the input string is exhausted. I'd be curious whether that would run faster than scanning for the whole match string. It might, if MEP uses different logic to find a single character vs. a string. As a matter of fact, I just checked my email, and found this little beauty from our good friends at ISS: This issue has been assigned the tracking number [iSS7518] I was considering filing a bug report, just as you suggested... but they are on their toes over there. Gotta love it! Quote Link to comment Share on other sites More sharing options...
Cory Posted September 29, 2009 Report Share Posted September 29, 2009 You mention "dyno" macros, and based on my recollection of some of your previous posts, I believe you use that term to refer to sub-macros called by the main macro. Is that correct? I'm curious how you would implement such a macro to perform the function I'm striving for here. Dyno has nothing to do with sub-macros. What I would do is grab one piece of functionality and repeat that a bunch of times or for a period of time and determine speed. This would include logging the start time, finish time, number of iterations and displaying a result. This way I could measure how long each of my constituent commands were taking. The analogy is an engine dynometer for measuring horse power. Another trick is to go into the Miscellaneous tab of the macro and choose to have it log every command. Then drag that into Excel and create a new column where you subtract the adjacent cell to the one above it to determine ho wlong each one it taking. You can even graph it to see where the peaks or concentrations are. I still like my idea of distilling your big text file first. I thought about it some more and this is what I dreamed up. Say for instance I was looking for SSNs. I would find the first space, cut it out and trim and if the length is 11 characters move forward else skip. The tests I wouldn't get to intensive on. For instance I migh tchaeck that the numeric characters are all numbers and where there should be dashes there are dashes. Hell, or just look for dashes in the right place. That should weed out most and it don't need to be perfect. Then if it passes the test write it to an array variable. In the end you will have a nice array of all the SSNs and now when you do your ASCII File Process it's really easy to do an array scan. Now if you really want to get trick follow my simple method of sorting that array as you generate it. This way it Can be unique and in order. The means you can have a very well optimized array scan. If you start at element one and at element X the value is greater then you know you don't have to search any more. Also if you primary list from the other file is sorted you can do a low water mark as well meaning you will not ever scan the ones below the last value you found either. This will reduce the number of comparisons by an order or two of magnitude. Quote Link to comment Share on other sites More sharing options...
stevecasper Posted September 30, 2009 Author Report Share Posted September 30, 2009 Dyno has nothing to do with sub-macros. What I would do is grab one piece of functionality and repeat that a bunch of times or for a period of time and determine speed. This would include logging the start time, finish time, number of iterations and displaying a result. This way I could measure how long each of my constituent commands were taking. The analogy is an engine dynometer for measuring horse power. Another trick is to go into the Miscellaneous tab of the macro and choose to have it log every command. Then drag that into Excel and create a new column where you subtract the adjacent cell to the one above it to determine ho wlong each one it taking. You can even graph it to see where the peaks or concentrations are. I still like my idea of distilling your big text file first. I thought about it some more and this is what I dreamed up. Say for instance I was looking for SSNs. I would find the first space, cut it out and trim and if the length is 11 characters move forward else skip. The tests I wouldn't get to intensive on. For instance I migh tchaeck that the numeric characters are all numbers and where there should be dashes there are dashes. Hell, or just look for dashes in the right place. That should weed out most and it don't need to be perfect. Then if it passes the test write it to an array variable. In the end you will have a nice array of all the SSNs and now when you do your ASCII File Process it's really easy to do an array scan. Now if you really want to get trick follow my simple method of sorting that array as you generate it. This way it Can be unique and in order. The means you can have a very well optimized array scan. If you start at element one and at element X the value is greater then you know you don't have to search any more. Also if you primary list from the other file is sorted you can do a low water mark as well meaning you will not ever scan the ones below the last value you found either. This will reduce the number of comparisons by an order or two of magnitude. Cory, Thank you, as usual, for your input. I've thought about distilling the text file... but to be honest, some of these text files are over 200 MB in size, which translates to over 100,000 printed pages (according to the Citrix source file that the .txt is downloaded from). The process of distilling the file would take a number of hours, after which, I'd still need to make the comparison. All I need the macro to do is verify whether or not an account number appears at any point in the text file. ME3 is able to accomplish this in a fraction of a second (I calculated over 17 checks every second, actually), but MEP takes between 10 and 15 seconds to process just one file. To give a little more information about what is in the text file, there are bank routing numbers, account numbers, various other numbers (I don't know what most of them pertain to, could be anything from DL# to SSNs - they aren't specifically identified); client names; business names; transaction dates; etc. Each "page" is separated by a multi-line header and multi-line footer each with specific information regarding the banking transactions contained between the header and footer. The header and footer also contain numerous special characters used as separators and line-dividers. It's honestly a huge waste of space (some of the pages contain less information than the header and footer). Another thought on the distillation idea: In order to distill it, I'd need something to base the process on, but since the pages look an awful lot like this: ________RTI___________RFI____________ACCT_________CLIENT______________OTHER STUFF___________001231635413003 ___ 900058188846 _ 90059465895 ___ John F. Doe ____ 5500506650004546545901095 121315655132 ____ 5956564654595 ___ 5226262984 ____ Jane C. Doe ____ 456645459594645954656465464 (Just imagine that those are in columns that make sense... ) I don't really have any firm hook-point to look for. I could make it find the first integer after the "other stuff" section, remove everything between the first and second chunks of spaces, and I would have my account number and client name squared away, but again... this goes on for tens of thousands of pages with anywhere from 5-30 lines to distill on each page. And I'd still have to do it all within a variable. It's possible that I'm not fully understanding the processes you're recommending (you have way more experience at this than I do), so please forgive me if my density gets a little frustrating. By way of anecdote: Today I built a mock-up of a macro that does an ASCII process from my "Mule" file (as does the macro I'm actually using), but this one also utilizes a Text File Process scanning line by line through the .txt file. This was an attempt to eliminate false positives from my final output text file. I knew it would be ridiculous, so I only had it process 1 account from the "Mule". It took 1 minute 16 seconds. I calculated that it would take about 11.5 hours to process the entire 220 MB file I tested it on (548 accounts * 76 seconds = 41,648 seconds = 694.1333 minutes = 11.56888 hours). This would result in 0 false positives, which is a win. But half a terrestrial day would be used. The macro I'm currently running as the live 1.1 version processed that file in about 2 minutes and 50 seconds, with 14 matches. All false positives. A human person could then go through those 14 matches and eliminate the false positives in a matter of only a few minutes more. I'm waiting on approval to implement my beta 1.2 version that looks for account number matches, if matched, check for R/T matches. If match, check for Client first name match (long story why I can't use last name). If all three occur within the .txt then the account gets added to the output file. It reduced a 14 matches (via 1.1) to only 4 matches. Which were also false positives - verified manually within about 2 minutes. But it all has to be done with ME3 - a very fine program indeed - though I would really rather be able to utilize MEP, since it has so many more aesthetic and logical qualities to offer. Quote Link to comment Share on other sites More sharing options...
Cory Posted September 30, 2009 Report Share Posted September 30, 2009 I believe I understand the algorithm used in the "If Contains" and although it running in compiled code as opposed to MEPs script I still think it would be better to distill and organize the second file first. I believe it would be more efficient in either case but it's unneeded in ME3 evidently which would be a good question for ISS. But here again it would be really nice if you could make some demo macros that quantify the difference to show them. You like mathematic proofs so consider an analogy. Let's say you're looking for matches of a single character in a file with 100k characters. This would be your B list. The A list contains 700 characters to compare from. Your A (ASCII FP) list. That results in 70 million comparisons. But lets say you know that all the characters are numeric in the B list and only 5% of the characters in the A list are numbers. By removing all non-numbers the number of comparisons drops to 3.5M. I can't help but think a little time weeding out the non account numbers will not help. Now you also mentioned doing other comparisons vetting false positives. If in addition to harvesting the account numbers you also want to do vetting of possible false positives you could have a ready made catalog of several array variables to speed this process as well. IE don't just pull out account numbers but instead store all the columns in arrays. Not only do you make the searching faster but by using the same index number you have instant access to the other fields as well. Only one way to find out if I'm right, let's try it! Quote Link to comment Share on other sites More sharing options...
stevecasper Posted September 30, 2009 Author Report Share Posted September 30, 2009 You like mathematic proofs so consider an analogy. Let's say you're looking for matches of a single character in a file with 100k characters. This would be your B list. The A list contains 700 characters to compare from. Your A (ASCII FP) list. That results in 70 million comparisons. But lets say you know that all the characters are numeric in the B list and only 5% of the characters in the A list are numbers. By removing all non-numbers the number of comparisons drops to 3.5M. I can't help but think a little time weeding out the non account numbers will not help. Ok Cory, Because I trust you, and because I think I've tried just about every possible way to weed out the unnecessary information that I can think of, I'm going to come right out and ask how you would do it. I've taken a small page from one of today's reports and altered account information and the name of the client, and I'm adding a screenshot of it here. The highlighted line contains the only information on this page that I would care to keep. The underlined information is the only information that would be pertinent to my macro. None of this information is uniform throughout the text document. Some lines will have the same Trace information for several lines, but then that will change suddenly. The macro scans through the text file looking for matches to the RFI Account No column. The way the macro is built it checks for the existence of the account number anywhere in the document, which results in false-positives when an account number is short and that sequence of numbers appears at the end of any of the other number-series within the document (the acct number is modified with a trailing space to eliminate matches in the middle of another sequence - no leading space is added, because some accounts may contain leading zeroes in the text document that are missing from the Mule). This is an abnormally short page. On most pages, the highlighted section would be 10-30 lines long. So, I'm curious how you would strip the extra baggage out of this file? I've tried setting Integer variables to positions of text that are the same throughout (such as the INFOPOINT at the top , which doesn't appear to change from page to page; and "TRACE" which appears before each section of accounts ), and then deleting everything within them. I've put it on a repeat and let it go. It worked great on the 300KB file, but when I attempted it on a 39MB file the I let it run for about 5 minutes before deciding this was not going to be efficient enough, since I can run the ME3 macro without distilling and have my results on that 39MB file in 41 seconds. 44 seconds if I run the beta version which removes a good chunk of false positives for me. -EDIT A note about false positives: They're not a big deal. Of the 548 accounts on my Mule file, I have to search for each and every one through all 5 text reports, however, on average less than 20 account numbers match even on the 100,000 page reports. The idea is to remove as many accounts from the list as possible prior to manual review. 20 accounts is much better than 548. My bosses don't care if there are 20 accounts or 5, as long as there aren't 548. But I'm occasionally very anal about this kind of thing, so I want to eliminate all the false positives I possibly can - but it isn't a priority. Quote Link to comment Share on other sites More sharing options...
Cory Posted September 30, 2009 Report Share Posted September 30, 2009 You're going to owe me a beer for using up my lunch hour. There are a lot of it depends in here because I can not see more data. EG I assume trace numbers are always 15 char and no leading zeros. I will also assume the far right columns are misaligned because of your edits. However this can all be modified if I knew what the bounding rules for your data were. Now I haven’t tested this and this is only one of many approaches but this will leave you with an array for each column that you can do whatever you like with. However if all you really needed was the account number you should ditch the rest. Also you should have no false positives this way because when you scan the array you will look for exact matches. If you were doing it your way using the If Contains you should look for “ %acct No% “ instead of “%acct No%“. EG with spaces. This prevents partial matches. When I output to the arrays like this I normally check for duplicates and organize as well but I left out for simplicity in this case. Essentially when I go to put something in an array I start at element one and compare to each and every element. If I hit an exact match I bail as it’s a duplicate. If I go greater than I start another loop that bubbles element X and inserts the new value at X. Then it continues to bump each up in the array. IE it inserts my value like you would sorting a deck of cards. Now when you do an array scan don’t just compare for ‘equal to’ but also ‘greater than’. If you find that your value is greater than the current element number then you know you have gone to far and there is no match possible. Saves a ton of comparisons. Also if your source list is in order you can set low water marks as well. Let’s say you’re half way thru and the last match or greater than was found at index 1234. Well you can start at 1235 and it avoids 1,234 comparisons! Combined you end up comparing a very few number of elements. Probably on the order of a few percent. Stevo_Harvest.mex Quote Link to comment Share on other sites More sharing options...
stevecasper Posted September 30, 2009 Author Report Share Posted September 30, 2009 You're going to owe me a beer for using up my lunch hour. There are a lot of it depends in here because I can not see more data. EG I assume trace numbers are always 15 char and no leading zeros. I will also assume the far right columns are misaligned because of your edits. However this can all be modified if I knew what the bounding rules for your data were. Now I haven’t tested this and this is only one of many approaches but this will leave you with an array for each column that you can do whatever you like with. However if all you really needed was the account number you should ditch the rest. Also you should have no false positives this way because when you scan the array you will look for exact matches. If you were doing it your way using the If Contains you should look for “ %acct No% “ instead of “%acct No%“. EG with spaces. This prevents partial matches. When I output to the arrays like this I normally check for duplicates and organize as well but I left out for simplicity in this case. Essentially when I go to put something in an array I start at element one and compare to each and every element. If I hit an exact match I bail as it’s a duplicate. If I go greater than I start another loop that bubbles element X and inserts the new value at X. Then it continues to bump each up in the array. IE it inserts my value like you would sorting a deck of cards. Now when you do an array scan don’t just compare for ‘equal to’ but also ‘greater than’. If you find that your value is greater than the current element number then you know you have gone to far and there is no match possible. Saves a ton of comparisons. Also if your source list is in order you can set low water marks as well. Let’s say you’re half way thru and the last match or greater than was found at index 1234. Well you can start at 1235 and it avoids 1,234 comparisons! Combined you end up comparing a very few number of elements. Probably on the order of a few percent. I've looked over the code you sent, and I have to say that I hadn't considered the exact processes you are using. I won't be able to test it till tomorrow at the soonest, but I promise to keep my hopes up. And if we ever meet in person, I promise to buy you a beer. Something cheap, though, I'm on an Eggos-for-breakfast, lunch, and dinner budget Quote Link to comment Share on other sites More sharing options...
rberq Posted September 30, 2009 Report Share Posted September 30, 2009 This isn't what you want to hear, but: With ME3 you have a simple macro that does the job in a very respectable length of time. With MEP you are considering complex pre-processing of your multi-megabyte text file, to get around a problem in MEP that ISS will probably resolve in short order. Why not just wait for that fix? If you must pre-process, MEP is probably NOT the appropriate tool with which to do it. It likely could be done quicker and far easier on the system that produces the report in the first place; or in a good text editor that can sort the lines and then delete huge blocks of them at one fell swoop. But since we are trying things (which I admit is great fun), I notice that your original ME3 macro loads the huge text file into a standard ME variable (T11, as I recall). The MEP macro loads it into a custom-named variable. How about you use MEP, but revert to loading it into a standard Txx variable and see if that makes a difference? (If MEP doesn't have the "standard" variables any more, then ignore my suggestion -- I don't have MEP yet so I'm just guessing.) Quote Link to comment Share on other sites More sharing options...
Cory Posted October 1, 2009 Report Share Posted October 1, 2009 All good points Bob, especially the bit about MEP really not being the right tool. Often we forget that MEP is intended to be a UI automation program with some good programming capabilities and not a full blown, optimized, compiled programming language. I've been writing many macros that are completely invisible and realize that it would be better to write a proper program. Hence I'm teaching myself VB.NET. However I doubt I will ever leave MEP but rather write programs for data crunching and such or to make add-ins for MEP as others here have. I disagree however that ISS will 'fix this' anytime soon. In the beta program I found several issue with speed and variables with certain processes and I know they are worked very hard and improved most of them but I get the feeling that a lot of the lethargy is due to the new variable types, new features, and improved algorithms. My guess is it's either A) as good as it's going to get, or an identifiable problem that will be difficult to fix and test and will take some time to get to a release version. In any case adding the preprocessing will only speed things up and will not hurt if the process gets quicker. BTW I think that scanning an array for exact matches will be faster and more accurate than an "If Contains". I have not tested that but it would be easy to try. The first time Steve does it he could dump my array to a file and test this easily. I do not believe that naming a variable T will do anything. There are no fundamentally 'old school' variables. They are all the new type. However ti would be easy to test. Quote Link to comment Share on other sites More sharing options...
stevecasper Posted October 1, 2009 Author Report Share Posted October 1, 2009 This isn't what you want to hear, but:With ME3 you have a simple macro that does the job in a very respectable length of time. With MEP you are considering complex pre-processing of your multi-megabyte text file, to get around a problem in MEP that ISS will probably resolve in short order. Why not just wait for that fix? If you must pre-process, MEP is probably NOT the appropriate tool with which to do it. It likely could be done quicker and far easier on the system that produces the report in the first place; or in a good text editor that can sort the lines and then delete huge blocks of them at one fell swoop. But since we are trying things (which I admit is great fun), I notice that your original ME3 macro loads the huge text file into a standard ME variable (T11, as I recall). The MEP macro loads it into a custom-named variable. How about you use MEP, but revert to loading it into a standard Txx variable and see if that makes a difference? (If MEP doesn't have the "standard" variables any more, then ignore my suggestion -- I don't have MEP yet so I'm just guessing.) Actually, you're probably not far off on a lot of this. However the "standard" variables are, as far as I understand them, no different from the custom variables in MEP. You can name your variable "T" or "TexasHoldem" and MEP handles them identically. As far as waiting for ISS to address/fix the issue... Yeah, that's probably what will end up happening, however, I'm learning a LOT about other functions in the mean time. I tend to get stuck in a rut thinking, "This situation requires the macro to do this..." Running into a road-block that forces me to find other ways to perform what should be a simple process opens doors for enlightenment. Sure, if the roadblock wasn't there, heading straight down the road would be the most efficient, fastest course, but I end up missing the challenge of cutting a new path, building a bridge, etc. which - though less efficient for going down the road - may be the most efficient course for something else I come across. Honestly, I probably will keep using the ME3 version of this macro until the developers (AKA real programmers) finish building the program they're working on that will do what this macro already does, only completely automated. Should only be another week or two. My macro is only intended as a "bandaid." Now to see if I can make Cory's macro work for me... Quote Link to comment Share on other sites More sharing options...
chris Posted October 1, 2009 Report Share Posted October 1, 2009 OK, I'm from ISS and I'm very familiar with how the "If Variable Contains" command works. I'm also in a position where I can fix it. First off, is there some way you can send me a test file (and let me know which of the above macros to use) in order to test this? I think I may know where the slowdown is. Also, how many variables do you currently have defined in your macro (as this can greatly affect the speed)? I'd like to get this resolved as quickly as possible. Thank you for your patience with us. Quote Link to comment Share on other sites More sharing options...
Cory Posted October 1, 2009 Report Share Posted October 1, 2009 So good to hear from you Chris. If Steve doesn't have time I can write one for you. Would you prefer someone create a bugrep for this as well? Quote Link to comment Share on other sites More sharing options...
chris Posted October 1, 2009 Report Share Posted October 1, 2009 There's already a bugrep about this issue (it's what brought me over here in the firstplace). However, any macro that I can use to test the speed difference between ME3 and MEPro (as far as this particular command is concerned) would be helpful. I actually have a theory about the slowdown and if it is related to the "If Variable Contains" command, like Steve suggested, and I can noticeably improve its performance, I can apply that same fix throughout the entire program, which would be a good thing all around. Quote Link to comment Share on other sites More sharing options...
stevecasper Posted October 1, 2009 Author Report Share Posted October 1, 2009 OK, I'm from ISS and I'm very familiar with how the "If Variable Contains" command works. I'm also in a position where I can fix it. First off, is there some way you can send me a test file (and let me know which of the above macros to use) in order to test this? I think I may know where the slowdown is. Also, how many variables do you currently have defined in your macro (as this can greatly affect the speed)? I'd like to get this resolved as quickly as possible. Thank you for your patience with us. Hi Chris, I do have a test file I can post here, but it's on my home PC. The files here at work are full of very sensitive information, but at home I built a macro that systematically built a 30+ MB text file that contains a lot of random number-series and letter sequences. A small fraction of those numbers and letter sequences match (by design) some mock-up account numbers, routing numbers, and client names in a .csv file that I also built through a macro. The .csv looks much like the one I'm using here at work, but the .txt file looks nothing like it. However it does contain exactly the information I need the macro to find via the "If contains" command. I don't know what the file-size limit is on this board, but I will post it here if I can, once I get home from work. If not, I'll cut it down until it does fit. Not including the built-in T and N variables, I have defined 11 Text Variables, and 2 Integer Variables. The ME Pro macro to use for the test is: <VARIABLE SET STRING Option="\x00" Destination="%MuleLocation%" Value="G:\\Loss Prevention\\ACH\\Mule Accounts.csv"/> <TEXT BOX DISPLAY Title="Find the Source" Content="{\\rtf1\\ansi\\deff0{\\fonttbl{\\f0\\fnil Tahoma;}}\r\n{\\colortbl;\\red0\\green0\\blue255;}\r\n\\viewkind4\\uc1\\pard\\cf1\\lang1033\\b\\f0\\fs24 Browse to the Source ACH file.\\cf0\\b0\\fs16 \r\n\\par }\r\n" Left="469" Top="50" Width="295" Height="73" Monitor="0" OnTop="TRUE" Keep_Focus="FALSE" Mode="\x02" Delay="0"/> <VARIABLE SET STRING Option="\x08" Destination="%ACHSourceFile[1]%" Value="Browse to the ACH SOURCE .txt File"/> <TEXT BOX CLOSE Header="Find the Source"/> <VARIABLE SET STRING Option="\x00" Destination="%ACHSourceFile[2]%" Value="%ACHSourceFile[1]%"/> <VARIABLE MODIFY STRING Option="\x0F" Destination="%ACHSourceFile[2]%" ToReplace=".txt" ReplaceWith=" " All="TRUE" IgnoreCase="TRUE"/> <VARIABLE SET STRING Option="\x00" Destination="%ACHMatchName%" Value="%ACHSourceFile[2]% Match File.txt"/> <VARIABLE MODIFY STRING Option="\x11" Destination="%ACHMatchFile%" Filename="%ACHMatchName%" CRLF="FALSE"/> <DATE/TIME Format="h:mm:ss AMPM" Flags="\x80" Date="12/30/1899" Day_Offset="0" Month_Offset="0" Year_Offset="0" Hour_Offset="0" Second_Offset="0" Left="Center" Top="Center" Monitor="0" Variable="%Start%" IsDateVar="FALSE" _IGNORE="0x0002"/> <TEXT BOX DISPLAY Title="ACH Checker" Content="{\\rtf1\\ansi\\ansicpg1252\\deff0{\\fonttbl{\\f0\\fnil Tahoma;}{\\f1\\fnil\\fcharset0 Tahoma;}}\r\n{\\colortbl;\\red0\\green0\\blue255;\\red0\\green128\\blue0;}\r\n\\viewkind4\\uc1\\pard\\qc\\cf1\\lang1033\\b\\f0\\fs24 The macro is running.\\cf0\\b0\\fs16 \r\n\\par \\pard \r\n\\par \\pard\\qc\\f1 Items processed so far:\\i \\cf2 %Processed%\\i0\\f0 \r\n\\par \\pard\\cf0 \r\n\\par \\pard\\qc Started: \\b %Start%\r\n\\par }\r\n" Left="469" Top="50" Width="290" Height="128" Monitor="0" OnTop="TRUE" Keep_Focus="FALSE" Mode="\x02" Delay="0"/> <VARIABLE SET STRING Option="\x03" Destination="%ACHTextContents%" Filename="%ACHSourceFile[1]%" Strip="FALSE"/> <ASCII FILE BEGIN PROCESS Filename="%MuleLocation%" Format="CSV" Start_Record="2" Process_All="TRUE" Records="0" Variable="%T%" Start_Index="1"/> <VARIABLE MODIFY INTEGER Option="\x07" Destination="%Processed%"/> <VARIABLE SET STRING Option="\x00" Destination="%T[4]%" Value="%T[4]% "/> <IF VARIABLE Variable="%ACHTextContents%" Condition="\x06" Value="%T[4]%" IgnoreCase="FALSE"/> <VARIABLE SET STRING Option="\x00" Destination="%MatchedClientInfo%" Value="%T[3]%, %T[4]%, %T[5]%, %T[6]%"/> <VARIABLE MODIFY STRING Option="\x12" Destination="%MatchedClientInfo%" Filename="%ACHMatchName%" CRLF="TRUE"/> <VARIABLE MODIFY INTEGER Option="\x07" Destination="%MatchIncrement%"/> <END IF/> <UPDATE TEXTBOX Header="ACH Checker" Content="{\\rtf1\\ansi\\ansicpg1252\\deff0{\\fonttbl{\\f0\\fnil Tahoma;}{\\f1\\fnil\\fcharset0 Tahoma;}}\r\n{\\colortbl;\\red0\\green0\\blue255;\\red0\\green128\\blue0;}\r\n\\viewkind4\\uc1\\pard\\qc\\cf1\\lang1033\\b\\f0\\fs24 The macro is running.\\cf0\\b0\\fs16 \r\n\\par \\pard \r\n\\par \\pard\\qc\\f1 Items processed so far:\\i \\cf2 %Processed%\\i0\\f0 \r\n\\par \\pard\\cf0 \r\n\\par \\pard\\qc Started: \\b %Start%\r\n\\par }\r\n"/> <ASCII FILE END PROCESS/> <TEXT BOX CLOSE Header="ACH Checker"/> <DATE/TIME Format="h:mm:ss AMPM" Flags="\x80" Date="12/30/1899" Day_Offset="0" Month_Offset="0" Year_Offset="0" Hour_Offset="0" Second_Offset="0" Left="Center" Top="Center" Monitor="0" Variable="%Finish%" IsDateVar="FALSE" _IGNORE="0x0002"/> <TEXT BOX DISPLAY Title="Finished" Content="{\\rtf1\\ansi\\ansicpg1252\\deff0{\\fonttbl{\\f0\\fnil Tahoma;}{\\f1\\fnil\\fcharset0 Tahoma;}}\r\n{\\colortbl;\\red0\\green0\\blue255;\\red0\\green128\\blue0;\\red255\\green0\\blue0;}\r\n\\viewkind4\\uc1\\pard\\qc\\cf1\\lang1033\\b\\f0\\fs24 The Checker has finished.\r\n\\par \\cf0\\b0\\f1\\fs16 Items processed: \\cf2 %Processed%\\cf0\\f0 \r\n\\par Matches found: \\cf3\\b\\fs24 %MatchIncrement%\\cf0\\b0\\fs16 \r\n\\par \\pard \r\n\\par Started - %Start%\r\n\\par Finished - %Finish%\r\n\\par }\r\n" Left="570" Top="Center" Width="278" Height="200" Monitor="0" OnTop="FALSE" Keep_Focus="TRUE" Mode="\x00" Delay="0"/> The ME 3 code to compare with and possibly use as a control in the test is: <TVAR2:14:01:G:\Loss Prevention\ACH\Mule Accounts.csv><TBOX4:F:6:000469000050000290000083:000:Find the SourceBrowse to the Source ACH file.><TVAR2:10:09:SOURCE File><TBCLOSE:Find the Source><TVAR2:15:01:%T10%><TMVAR2:21:15:01:001:000:.txt ><TVAR2:12:01:%T15% Match File.txt><TMVAR2:17:13:00:000:000:%T12%F><DT:h:mm:ss AMPMT:20:1:><TBOX4:F:6:000469000050000290000106:000:ACH CheckerThe macro is running. Started: %T20%><TVAR2:11:04:%T10%><ADFBEG:F10:001:000002:000000:%T14%><TVAR2:04:01:%T4% ><IFVAR2:4:11:7:T4><TVAR2:07:01:%T3%, %T4%, %T5%, %T6%><TMVAR2:20:07:00:000:000:%T12%T><NMVAR:08:01:0:0000001:0:0000000><ENDIF><ADFEND><TBCLOSE:ACH Checker><DT:h:mm:ss AMPMT:21:1:><TBOX4:T:1:000570Center000278000200:000:FinishedThe Checker has finished. Matches found: %N1% Started - %T20% Finished - %T21%> If this isn't exactly what you were looking for, please let me know! EDIT ACH_Mule_File.txt ACH_Test_File.txt I'll post these as a new reply to you, Chris, also. PLease note, the Mule File needs to be renamed to .csv (the forum wouldn't let me upload it as a .csv file). Quote Link to comment Share on other sites More sharing options...
stevecasper Posted October 1, 2009 Author Report Share Posted October 1, 2009 You're going to owe me a beer for using up my lunch hour. Ok, I've been messing with your harvest macro all morning, and I really like what I've learned from it. You looked at the problem from a much clearer perspective than I had, and took an approach I had looked at but tossed aside as too complicated... and you made it easy. I did have one problem right from the start, but it was an easy fix: ME doesn't like 15 digit integers, so I tweaked that early command down to 9 digits, and from that point the macro was smooth. But then I ran into a wall bricked out of my own denseness: Now that I have all these variables stacked %Counter% deep, what do I do with them? I came up with lots of options - everything from running a comparison on each processed line against the Mule file (quickly discarded as WAY cumbersom), down to simply adding a repeat loop at the end, set up to repeat %Counter% number of times, and create a whole new text file replete with the variables just created. When I was satisfied, I ran it on my 300KB text file. It ran fairly quickly (though not as fast as the If Contains via ME3), and looked beautiful and clean. Encouraged by this, I ran it agains a less modest 39MB file. After about 5 minutes, I killed it. I disabled most of the macro, just keeping the first part, searching for the 15 digit (9 digit) integer matches. I built a counter and set it up to give me a text box every 500, then I modified that to every 30,000. It took 11 minutes to pop up the 30K mark, and after about 22 minutes, it finally popped up 60,000. I then accidentally hit my abort key to kill another macro I was running, and incidentally shut down this one, too (oops). So this was taking too long to be useful. I don't know how many lines it was going to have, ultimately, but that 39MB file is nothing compared to the 250MB monster I had to check today (took just over 4 minutes with the If Contains in ME3). Now, there is a better-than-average chance that what I did wasn't anywhere near what you were expecting me to do. Because, let's face it, if we lived 5,000 years ago, I'd be the guy trying to figure out how you were able to make bronze from rock and sunlight. Oh, I got your email, I'll respond to it when I get home this afternoon. I think you are on to something. Quote Link to comment Share on other sites More sharing options...
Cory Posted October 1, 2009 Report Share Posted October 1, 2009 I really appreciate you looking into this and look forward to a fix. If your suspicions are true I would't mind a geek description of what the problem was. I always like opportunities to understand how things work behind the curtain. Quote Link to comment Share on other sites More sharing options...
Cory Posted October 1, 2009 Report Share Posted October 1, 2009 First question first: After propagating the array I would enter an ASCII FP and plow thru your other file. Each account number I would then enter a loop of the Length of the propagated array. It would have a counter so I would compare to %account no[%counter%]%. Of course I would have first set a boolean var %Found% to false and if I found a hit I would set %Found% to true and break. The If Found True then output my stuff. Now don't forget I would also remove duplicates and sort my array as I created it. In that case not only would I do an Equals to compare but also a greater than. IE if the array value at the current counter index is ever greater than my value I don't need to look anymore. Additionally if your first file has the accounts in order then you can set a low water mark as well. EG if my last time thru I ended at 2034 then I can start my next scan at 2035 and avoid 2000+ comparisons. That combined with the aforementioned high water mark would mean very few comparisons. It sounds like the test I proposed was not a good fit. Although I'm surprised that a simple conversion to integer should take so long. There are better tests you could do but again it depends on the 'rules' about how the data and it's format. I can think of several. For instance you could grab the first character and see if it's greater than or equal to zero and less than or equal to 9. Repeat that 15 time or whatever is sufficient to weed out other lines starting with numbers. But most would fail after the first character. Looking at your example there are 24 lines and only 2 start with a number. If all the rest of the rows are this way you could then you could do a second test for character 5 to see if it's a slash indicating page number. The result would be that 92% of the time it would only need to look at the one digit before skipping and 50% of the remaining would be cast out on the second comparison. If the data holds true to form this would be very efficient. Like I say, you have to look at all the data to find tricks like this. Oh I just saw one. If you test if the line is equal to "--------------- --------- ..." (the row of dashes above the desired row) then you could trip a flag so that the next iteration you do the data grab and skip. That would be SUPER simple and efficient! I wonder if the integer convert is going slow because so many of them are not integers. MEP might do several things to try and make this work which would cost time. Hmmm... If I had more time I might want to test this. Quote Link to comment Share on other sites More sharing options...
rberq Posted October 1, 2009 Report Share Posted October 1, 2009 ... when I go to put something in an array I ... start another loop that bubbles element X and inserts the new value at X ... it inserts my value like you would sorting a deck of cards. Now when you do an array scan ... Now, this is going to be fun for you to write if you haven't already done so. You have described perfectly how to build a sorted array. But to scan the array you could write a binary search. That is, you first take the number of elements in the array and divide by 2 to find the midpoint. Say the sorted array contains 96000 entries, and you want to find out if the number 1234567 is one of them. You first compare 1234567 to position 48000 of the array. If 1234567 is greater than the value in position 48000, you set 48000 as the new low point. Find the midpoint between 48000 and 96000, which is 72000. Compare 1234567 to the value in position 72000. If 1234567 is lower than that value, then you have narrowed your search to the section of the array between 48000 and 72000; if 1234567 is higher than that value, then you have narrowed your search to between 72000 and 96000. Continue until done. In other words, you have a simple loop that, in each pass through the loop, divides the portion of the array to be searched in half. Because you are working a geometric progression, in effect, even a huge sorted array can be searched extremely fast. What is 2 to the 20th power? Over a million, I think. So with a binary search you can scan a sorted array of a million items in about 20 passes through your little-bitty loop. It's almost instantaneous from a computer point of view. Only when you have narrowed the array down to less than, say, 4 or 8 items, do you revert to a sequential scan of those remaining few items. An array of a billion entries could be searched with only an additional 10 passes through the loop, that is, 30 passes in all. An array of a trillion entries, 40 passes total. I'll bet you can have the ME code written in an hour. Quote Link to comment Share on other sites More sharing options...
stevecasper Posted October 2, 2009 Author Report Share Posted October 2, 2009 OK, I'm from ISS and I'm very familiar with how the "If Variable Contains" command works. I'm also in a position where I can fix it. First off, is there some way you can send me a test file (and let me know which of the above macros to use) in order to test this? I think I may know where the slowdown is. Also, how many variables do you currently have defined in your macro (as this can greatly affect the speed)? I'd like to get this resolved as quickly as possible. Thank you for your patience with us. Chris, here are a pair of files you can use to test. The "Mule" file needs to be renamed as a .csv (the forum wouldn't let me upload it as that file type). This file is only 188KB, so the MEP version handles it fast enough... the slowness becomes obvious when the text file gets into MBs. You may just want to do a series of copy/pastes within the .txt to increase the file size. ACH_Mule_File.txt ACH_Test_File.txt Quote Link to comment Share on other sites More sharing options...
Cory Posted October 2, 2009 Report Share Posted October 2, 2009 I like the way you think Bob. It's good that you mentioned that as this is one of those things that that I often take for granted as obvious. I didn't mention it here because if the other array is sorted as well you don't need to. That is with a high and low water mark there are few to search. However I suppose if you could apply the binary search method within that to compound the efficiency if the array was really huge. But I subscribe to K.I.S.S. usually and if the speed is OK then there's no need. I wonder how big the array would need to be before the binary method would be statistically quicker. Hmmmm... Usually when I'm doing things like matching account numbers the sets are very close. IE I have a list of 1000 IDs with holes from 1-2000 and a search set of a few hundred ranging from 1-2000 so it usually only takes a few comparisons before a you're done. OT: There are several sort routines very similar to this, have you ever attempted to write one of those? I have up and went with the shuffle sort which I know is not the quickest. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.