arekowczarek Posted January 30, 2011 Report Share Posted January 30, 2011 Lately I was doing lots of text file processing. And as I have even a bigger project of that type ahead of me, I was trying to find a way to speed it up. Basically I need to analyze several thousands of .html files [ranging from 50 KB to 200 KB] using text file process command and extract specific values to ONE .INI file. I thought the best way to speed the things up would be to create a RAM disk and put both .html and the .ini file there. As a matter of fact it worked - I received about 130% speed boost. But later on I got curious what part of the macro is taking up the longest and the inconsistence of short tests I run led me to creating a deeper research. Here are the results: FIRST TEST - READ SPEED: <VARIABLE SET STRING Option="\x00" Destination="%T[1]%" NoEmbeddedVars="FALSE"/> <REPEAT START Start="1" Step="1" Count="50000" Save="FALSE"/> <VARIABLE SET STRING Option="\x03" Destination="%T[1]%" Filename="D:\\1 KB.txt" Strip="FALSE"/> <END REPEAT/> <BEEP/> Using this macro I tested its runtime for text files of following sizes [KB]: 1, 10, 50, 90, 100, 110, 200, 1000, 10000. In the first stage the files were located on a HDD, in the second on a RAM disk. Each test was run three times for extra accuracy. The biggest surprise: not only the read speeds are similar for both HDD and RAM disk, but also for files 1 KB to 1000KB, files are read FASTER from the HDD (!) than the Ramdisk. For files 1000 KB and bigger there's no difference between HDD and RAMdisk. Next interesting thing that can be observed - for both HDD and RAM disk the top performace is achieved while working with 100 KB files. Not 90 KB, not 110 KB, exactly 100 KB. How come this is the "favourite" file size? SECOND TEST - WRITE SPEED: <VARIABLE SET STRING Option="\x03" Destination="%T[1]%" Filename="D:\\1 KB.txt" Strip="FALSE" _BACK="00FFFFFF"/> <REPEAT START Start="1" Step="1" Count="50000" Save="FALSE" _BACK="00FFFFFF"/> <VARIABLE MODIFY STRING Option="\x11" Destination="%T[1]%" Filename="D:\\dummy.txt" CRLF="FALSE" _BACK="00FFFFFF"/> <END REPEAT _BACK="00FFFFFF"/> <BEEP _BACK="00FFFFFF"/> Using this macro I tested its runtime for text files of following sizes [KB]: 1, 10, 100, 1000, 10000. In the first stage the files were located on a HDD, in the second on a RAM disk. Each test was run three times for extra accuracy. This looks the way it should - files are written to the RAM disk 142% - 162% faster than they are to the HDD. (the bigger the file, the bigger the difference between HDD and RAM disk write performance. Platform used for tests: WIN XP PRO 32 ME PRO 4.2.2.1 StarWind Ramdisk 5.5 [emulating 100 MB RAM disk] WD 500 GB 7200 RPM HDD Athlon x2 2x2.1 GHz 2x1GB DDR2 800 MHz RAM sticks I was trying to run the test using a different RAM disk but failed to find a different freeware (likely didn't search good enough). If you can recommend one - please do, I'll be happy to re-run the tess. The StarWind RAM disk received the most credit throughout the forums and it's free therefore it was used for the tests. What is missing here is an SSD test but I don't have access to one atm. The general conclusions I drew from the tests: 1. If the files that are gonna be processed (read from the disk) are smaller than 1000 KB it is better to keep them on a HDD rather than on a RAM disk. 2. For files of any size that are going to be written to very frequently by a macro it is way better to store them on a RAM disk. I'd be glad to see some comments on the test results. If anyone can put some light on why it is faster to read files from HDD than RAM disk and why the 100 KB files are the fastest ones to be read, please do. Quote Link to comment Share on other sites More sharing options...
Cory Posted January 30, 2011 Report Share Posted January 30, 2011 At one time I did some testing similar to this, in fact we had a post about it in the forums here someplace, but my results were that there was no benefit to using the RAMDisk. But here’s the thing, there’s so much going on with the caching from both the HDD itself and the OS it’s hard to tell exactly what’s responsible. But I will tell you that Vista and even more so W7 had much better performance here than XP. But I think you need to ask yourself what’s the point to move this data to a RAMDisk. It’s to get it into the fast RAM, right? So why not instead just load it into RAM? IE read them into variables. Generally what I do is to set the file contents to a variable and then do my thing. Sometimes in arrays sometimes finding markers in the file and massaging the string variables. And then when I’m done write the my results back to a file in one hit. It seems to me that for what you are doing you have to read this data then save this data at least once in any case so no RAMDisk is going to help you there. And if everything else in between is done in variables you’re working in the fast world of RAM. IOW I’ve never seen a benefit of using a RAMDisk when I already have direct access to RAM. But I really do like your experimentation. It’s so good to put things to the test and see what works. Also my server has an SSD for its OS drive and I’d be happy to run any tests you like on it. I’ve found a lot of the performance hype about SSDs a little overblown but they are way cooler than conventional platter drives. Quote Link to comment Share on other sites More sharing options...
arekowczarek Posted February 1, 2011 Author Report Share Posted February 1, 2011 But I think you need to ask yourself what’s the point to move this data to a RAMDisk. It’s to get it into the fast RAM, right? So why not instead just load it into RAM? IE read them into variables. Generally what I do is to set the file contents to a variable and then do my thing. This is what I do al the time, except it's not gonna work very efficiently with my project unless I'm doing something wrong here. Let's say I need to find this string: <p class="shortDescription" located in a .html file. Then, I need to copy the rest of the line to another file. But here's the catch. The string in question appears several times in the file. I roughly know which part of the file, so I know to only check lines 4000-7000 of the file (I know the string appearance I am looking for is the only one within this part of the file. It is easy to check the specified lines using the Text File Process. Now let's try to accomplish that copying the whole file content into a variable. How would you separate the 4000-7000 lines from within the variable and perform the string search only within that area? The only idea I could think of to determine the 4000th line area was something like this: Repeat Start 4000 times Variable modify string - Replace /CRLF/ with " "[only one] End Repeat Variable set %N[1]% integer from the position of /CRLF/. Now %N[1]% indicates where the hot content starts. I didn't even go deeper into this idea since it has already failed efficiency test vs Text File Process method. Sometimes in arrays sometimes finding markers in the file and massaging the string variables. I don't think I follow here. The arrays part. And then when I’m done write the my results back to a file in one hit. It seems to me that for what you are doing you have to read this data then save this data at least once in any case so no RAMDisk is going to help you there. One hit saving is a no go. All the data retrieved has to be saved into an .INI file. One section and 18 entries per one HTML file. Saving every .ini entry has to be performed using a separate command, so the file has to be read to (modified) 18 times (well not really 18 cause some of the entries are populated using another macro, but still: 1 command per 1 entry saved). Tests proved that if the .ini file is located in RamDisk I'm gaining huge advantage versus storing the .ini on the HDD. And by huge I mean from here o--------------------------------------------> to here. That huge And if everything else in between is done in variables you’re working in the fast world of RAM. IOW I’ve never seen a benefit of using a RAMDisk when I already have direct access to RAM. That is true/obvious, Ramdisk cannot possibly outrun the RAM itself, but as I pointed above, not everything can be performed within a variable's content. Of course this statement will cease to apply if it turns out the "array" way you mentioned above can be applied as a solution. But I need you to clear this up for me. But I really do like your experimentation. It’s so good to put things to the test and see what works. Also my server has an SSD for its OS drive and I’d be happy to run any tests you like on it. I’ve found a lot of the performance hype about SSDs a little overblown but they are way cooler than conventional platter drives. I sure know the are. Was using one for a while but had to sell it (my money printer got broken, but as soon as it's fixed I'm getting myself one). As for the tests, I think it only makes sense to compare the HDD vs RamDisk vs SDD performance when all the three tests are performed on a single machine. We have to keep in mind that macro performance is greatly influenced by the CPU speed. Even macros like this: Repeat Start 50000 times Delay 1 ms End Repeat will produce very different runtime results for different machines. 62 seconds on my current machine [Athlon 2x2.1 GHz). I appreciate you volounteered for the tests. If you're interested (or anyone else of course) in running the whole set of tests on one machine (HDD + RamDisk + SSD) I wrote a macro that basiaclly only requires setting the file directories in the macro and it's good to run. Depending on the accuracy level you choose, both read and write tests will take from 0.5 to 1.5 hours. Test results are stored in a .txt file which I'd be happy to process and dress up into charts like the tests above. The file attached is indeed a .rar archive. I had to change its extension to allow attaching it. The ancient archive of endless joy and happines.txt Quote Link to comment Share on other sites More sharing options...
Cory Posted February 8, 2011 Report Share Posted February 8, 2011 I was speaking in generalities so of course there may be times when one needs to use files. But I'll follow the lead on the specifics in your case. Here's how I would limit my search to 4k-7k. Set %CR% to 0xOD (carriage return) Set %LF% to 0xOA (line feed) Start the repeat with folder (guess) Increment counter %C% Suck the file contents into %Temp% Split %Temp% into %Line% on "%CR%%LF%" (Now I have an array with each element one line) Repeat 2000 times starting at 4000 and keeping counter %C2% If %Line[%C2%]% contains "yada yada" (Test I need) Set %Output Line% to "%C%,%C2%,"some useful information" (Create a CSV line for my output file) Set %Output[%C%]% to %Output Line% (Store my results in an array instead of INI) Break Endif End Repeat End Repeat Join %Output% from 1 to %C% on "%CR%%LF% into %Temp% Save %Temp5 to my results CSV file I don't know why you're doing INI files but I like to save mine in TSV (CSV in this example) files. Point being save the output in an array and then at the end generate one results file instead of saving it thousands of times to thousands of files. I just wanted to give you a general idea how I approch this and avoid disk writes. Quote Link to comment Share on other sites More sharing options...
arekowczarek Posted February 9, 2011 Author Report Share Posted February 9, 2011 I was speaking in generalities so of course there may be times when one needs to use files. But I'll follow the lead on the specifics in your case. Here's how I would limit my search to 4k-7k. Set %CR% to 0xOD (carriage return) Set %LF% to 0xOA (line feed) Start the repeat with folder (guess) Increment counter %C% Suck the file contents into %Temp% Split %Temp% into %Line% on "%CR%%LF%" (Now I have an array with each element one line) Repeat 2000 times starting at 4000 and keeping counter %C2% If %Line[%C2%]% contains "yada yada" (Test I need) Set %Output Line% to "%C%,%C2%,"some useful information" (Create a CSV line for my output file) Set %Output[%C%]% to %Output Line% (Store my results in an array instead of INI) Break Endif End Repeat End Repeat Join %Output% from 1 to %C% on "%CR%%LF% into %Temp% Save %Temp5 to my results CSV file I don't know why you're doing INI files but I like to save mine in TSV (CSV in this example) files. Point being save the output in an array and then at the end generate one results file instead of saving it thousands of times to thousands of files. I just wanted to give you a general idea how I approch this and avoid disk writes. I have to admit that upon seeing your idea of separating "the lines of content" I thought that it was gonna be light speed. However I ran some tests (on a 10k line .html file) and it is still about 20 times (!) faster to use the Text File Process command rather than the splitting CR/LF macro. Really odd. The most time consuming part of your macro was checking the %LINE% array for "yoda yoda" occurence. The reason .ini files are used is because this database was never meant to grow up to this extent. Gonna convert it to csv asap or sooner. I wonder if anyone can answer some of the questiones asked in the OT? Mainly the 100 KB file being the fastest to read from. It's bugging me, I can't sleep, I can't eat, I can't... well ok, I'm just curious Thanks for the input Cory. Quote Link to comment Share on other sites More sharing options...
Cory Posted February 10, 2011 Report Share Posted February 10, 2011 I have to admit that upon seeing your idea of separating "the lines of content" I thought that it was gonna be light speed. However I ran some tests (on a 10k line .html file) and it is still about 20 times (!) faster to use the Text File Process command rather than the splitting CR/LF macro. Really odd. You might make sure you're running 4.2.2.1. They have made some recent improvements for performance with arrays. Quote Link to comment Share on other sites More sharing options...
arekowczarek Posted February 10, 2011 Author Report Share Posted February 10, 2011 You might make sure you're running 4.2.2.1. They have made some recent improvements for performance with arrays. 4.2.2.1 it is. I decided to attach the two macros I used for testing so that you can check for yourself and during cleaning up the code I stumbled upon one unnecessary command that was affecting the "CR/LF" macro (I added logging each line into a txt file to check if it was working alright and forgot to remove it afterwards). Dumb me, I admit it (but don't quote me, I'll deny) Anyway, I re-ran the tests, and although the Text File Process method advantage is not as big as before, it is still crushing - 5 times faster than the "CR/LF" macro. I attached the macro I used for testing and the html file that was processed. TEST Text file process vs caching.mex test.html Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.