terrypin Posted January 22, 2012 Report Share Posted January 22, 2012 Any Regex experts around please? As part of a macro I'm writing I have a text file that looks like this: --- Start paste --- [blackfordLane.jpg] File name = BlackfordLane.jpg Directory = C:\Docs\My Videos\PROJECTS\Thames Path Walk Projects\TP03 Project\Geograph Photos\GeoDay2\ Compression = JPEG, quality: 87, subsampling OFF Resolution = 96 x 96 DPI File date/time = 19/01/2012 / 15:01:23 - IPTC - Object Name - s bridge over the River Thames is not a footbridge but carries pipes. - COMMENT - Thames Path on Blackford Lane heading towards Blackford Farm, east of Castle Eaton. [Castle Eaton Church.jpg] File name = Castle Eaton Church.jpg Directory = C:\Docs\My Videos\PROJECTS\Thames Path Walk Projects\TP03 Project\Geograph Photos\GeoDay2\ Compression = JPEG, quality: 87, subsampling OFF Resolution = 72 x 72 DPI File date/time = 19/01/2012 / 14:03:55 - EXIF - Make - FUJIFILM Model - FinePix2600Zoom Orientation - Top left XResolution - 72 YResolution - 72 ResolutionUnit - Inch - COMMENT - Castle Eaton Church [CastleEaton-2.jpg] File name = CastleEaton-2.jpg Directory = C:\Docs\My Videos\PROJECTS\Thames Path Walk Projects\TP03 Project\Geograph Photos\GeoDay2\ Compression = JPEG, quality: 75 Resolution = 0 x 0 DPI File date/time = 18/01/2012 / 15:40:05 - COMMENT - The Red Lion, Castle Eaton A warm welcoming pub on a cold winter's day, with the River Thames running at the bottom of the garden. --- End paste --- etc This is what I want to get as a result: BlackfordLane.jpg Thames Path on Blackford Lane heading towards Blackford Farm, east of Castle Eaton. Castle Eaton Church.jpg Castle Eaton Church CastleEaton-2.jpg The Red Lion, Castle Eaton A warm welcoming pub on a cold winter's day, with the River Thames running at the bottom of the garden. etc My first line of attack is to try for a Regex expression that will Find everything (for example) between the ']' of '[blackfordLane.jpg]' and the '-' of '- COMMENT -'? That would leave only a little tidying up. But so far it's eluded me after a couple of hours. The best I could come up with was the following to delete all lines from File name... to File date/time (with the Replace box empty): File name = .*\nDirectory = .*\nCompression = .*\nResolution = .*\nImage dimensions = .*\nPrint size = .*\nColor depth = .*\nNumber of unique colors = .*\nDisk size = .*\nCurrent memory size = .*\nFile date/time = .*\n But that's only part of the task and seems very inelegant. Any suggestions please? I'm also about to post in a couple of Regex forums. -- Terry, East Grinstead, UK Quote Link to comment Share on other sites More sharing options...
Cory Posted January 22, 2012 Report Share Posted January 22, 2012 I am no expert but this is an easy one compared to the expressions I have been writing lately for my scrapers. In VB.NET I would create a match collection for these then loop thru each and reassemble into a string variable accumulator. I tested these and they all work perfectly. Filename: I’m finding what’s between the square brackets. But in RegEx one must escape the square brackets because square brackets are a special character. Parenthesis create a backreference to Group 1 so you will only get the file name and not the square brackets. Remember that group numbering is not zero based in RegEx. \[(.*)\] Comment: In this case you need to use the ‘dot matches newline’ and make it lazy with a question mark. - COMMENT -\r\n(.*?)\r\n \r\n Quote Link to comment Share on other sites More sharing options...
terrypin Posted January 22, 2012 Author Report Share Posted January 22, 2012 I am no expert but this is an easy one compared to the expressions I have been writing lately for my scrapers. In VB.NET I would create a match collection for these then loop thru each and reassemble into a string variable accumulator. I tested these and they all work perfectly. Filename: I’m finding what’s between the square brackets. But in RegEx one must escape the square brackets because square brackets are a special character. Parenthesis create a backreference to Group 1 so you will only get the file name and not the square brackets. Remember that group numbering is not zero based in RegEx. \[(.*)\] Comment: In this case you need to use the ‘dot matches newline’ and make it lazy with a question mark. - COMMENT -\r\n(.*?)\r\n \r\n Thanks, I'll study that and experiment. BTW, what are 'scrapers'? Edit: More important, what is that \r? TextPad's Regex (the POSIX variety aparently) doesn't seem to include that option. Could you spell out what that code is specifying please, and I'll see if there's an equivalent to '\r' I can use. Edit 2: I'm guessing that \r is a CR? In which case I don't see why I need it, even if TextPad supported it? Isn't \n (Return) sufficient? Anyway, I eventually tried - COMMENT -\n(.*?)\n which is OK up to a point. But it doesn't find the second line of comment if there is one, such as for CastleEaton-2.jpg -- Terry, East Grinstead, UK Quote Link to comment Share on other sites More sharing options...
Cory Posted January 23, 2012 Report Share Posted January 23, 2012 Scrapers are applications that in some way collect and organize data from some bigger and often unorganized data source. In the old days with terminal emulators we would 'screen-scrape' mainframes that didn't have a proper reporting system. EG go thru every account and collect the past due balances of ever active account. But these days most scrapers are web bots that collect the data from web pages. For instance the one I've been most actively evolving retrieves tax record data from a multitude of county websites in different formats and outputs one data file with the results. The test sets I run are usually only a few thousand records but I know my client ran one recently with 650k records. Way beyond the scope of MEP. Besides the fact that it runs 1 to 2 orders of magnitude faster it never bombs out or has any timing issues. And since I develop in VB.NET I can give my client an executable to run and not rely on having MEP installed. Check out http://www.regular-expressions.info/ and RegEx Buddy. RegEx buddy will help you build an expression and even has a wizard to generate the code for you in a variety of languages. I don't use the wizards so I can't say how well they work. But the expression builder/tester makes it so much easier to see what's what. \r is a Carriage Return. \n is Line Feed. N is for Newline In Unix and other internet based systems often use just the Line Feed whereas Windows has CRLF. The main problem here is that you really should have attached a file instead of pasting it in the message so I can see the invisible characters. And I don't know if the forum software is doing replacements. Or at least encapsulate it with code tags. Alternatively you could use the opening square bracket and trim. But in the example I gave you the hex editor revealed that the last blank line contained one space. So to detect the first blank line I and trapping on EoL > space > EoL. EG "\r\n \r\n" and the capturing parenthesis are such that I do so without returning these characters. Are you actually using TexPad to do the extraction? My 2¢: Don't. I would write it as a VBScript. You can even embed that into an MEP macro if you like and run as External Script. At least I believe VBScript has support for RegEx. Quote Link to comment Share on other sites More sharing options...
terrypin Posted January 23, 2012 Author Report Share Posted January 23, 2012 Thanks for that helpful follow-up. But, as I've mentioned a few times, I'm not a programmer. So VBScript is not in my repertoire of tools! I have Regex Coach but I'll certainly check out RegEx Buddy. I have also just installed Gawk, on recommendation elsewhere. It looks powerful (and did solve the immediate requirement, albeit working in copy/paste mode!). But again it would require major effort to learn. -- Terry, East Grinstead, UK Quote Link to comment Share on other sites More sharing options...
Cory Posted January 30, 2012 Report Share Posted January 30, 2012 If you like I can create some RegEx for you. It is difficult to understand at first but if you have someone create the code it's easy to see how you can modify it to your needs without having to understand how it all works. For instance I could write a VBScript that you could set simple MEP variables for that will return all the email addresses on a web page or something like that. And doing a RegEx that gets the match collection from a string is only 2-3 lines of code. Super simple. And you don't need a supporting program and it happens all from within your macro invisibly. Quote Link to comment Share on other sites More sharing options...
terrypin Posted January 31, 2012 Author Report Share Posted January 31, 2012 Thanks Cory, that's generous of you, but I'm going to try to manage this sort of stuff myself if possible. I've been using RegEx for some years now and I'm OK with it for most purposes I encounter. I suppose I use it in TextPad a couple of times a week on average, with bursts of activity for some projects. The challenge arises with tasks like the one that prompted this thread. And it's now clear after research in the TextPad forum that its implementation in that otherwise excellent editor falls well short of the more powerful repertoire of PERLE etc. In particular TextPad's RegEx can't easily find/replace multi-line returns. I've now supplemented it by a sister program, WildEdit that doesn't suffer that limitation. So I now have several approaches to problems like this: 1. Continue to use Textpad, with which I'm so familiar, and complement it with a macro to handle the whole file. 2. Use WildEdit alone. 3. Use AutoHotkey, a scripting language in which I'm dipping my toes, and which has strong RegEx as far as I can gather. It also benefits from a very active and helpful forum. Oh, and 4. Post here for help from you! One pre-requisite of course is a decent grasp of RegEx itself. A root snag is that my skill level waxes and wanes. It was pretty good a few years ago, after some intensive study motivated by some particular project or curiousity. But then months elapse when I have no need for anything more than simple stuff, and I forget 90% of it again! -- Terry, East Grinstead, UK Quote Link to comment Share on other sites More sharing options...
paul Posted January 31, 2012 Report Share Posted January 31, 2012 3. Use AutoHotkey, a scripting language in which I'm dipping my toes, and which has strong RegEx as far as I can gather. It also benefits from a very active and helpful forum. AutoHotkey is a very useful utility, and I suspect there are some things (mainly keyboard and hotkey orientated) that cannot be done so easily in other languages like AutoIt (which shares some common ancestry with AutoHotkey). But as far as language constructs and syntax are concerned, AutoHotkey really is truly horrible. It's totally non-standard; one example is of text strings that included embedded spaces, which, in most cases, don't require (and must not use) surrounding quotation marks. I'm certain that AutoIt is at least as powerful as AutoHotkey in its handling of regular expressions, and I firmly believe that the AutoIt Help documentation is superior to that in AHK, and that the AutoIt forums are supported in a more professional way. Quote Link to comment Share on other sites More sharing options...
terrypin Posted February 1, 2012 Author Report Share Posted February 1, 2012 Thanks Paul, I'll take another look at AutoiIt. Maybe it's not typical, but on each of several attempts this morning its forum at http://www.autoitscript.com/forum/ has taken 30-60 seconds to appear. Edit: An hour on and connection now seems much faster. With apologies for getting further OT, a couple of queries: 1. I have a dozen or so AHK scripts in regular use. Presumably there will be no 'conflict' if I run those as well as AutoIt scripts? 2. But ideally I'd like to settle on just ONE of those tools. Are there any aids to help convert AHK to AutoIt? 3. Forum seems just as active as AHK's and many resources available. Of the scores of 'tutorials', can you recommend one or two to get me started please? -- Terry, East Grinstead, UK Quote Link to comment Share on other sites More sharing options...
paul Posted February 1, 2012 Report Share Posted February 1, 2012 With apologies for getting further OT, a couple of queries: 1. I have a dozen or so AHK scripts in regular use. Presumably there will be no 'conflict' if I run those as well as AutoIt scripts? No conflicts occur. 2. But ideally I'd like to settle on just ONE of those tools. Are there any aids to help convert AHK to AutoIt? Not that I know of. But I am willing, able and available (for a fee) to help you with this, or to do the conversions for you if you don't have the time. 3. Forum seems just as active as AHK's and many resources available. Of the scores of 'tutorials', can you recommend one or two to get me started please? Wiki recommended Regular Expressions Function documentation Regular Expressions Forums search for StringRegExp Quote Link to comment Share on other sites More sharing options...
terrypin Posted February 1, 2012 Author Report Share Posted February 1, 2012 Thanks Paul, very helpful. -- Terry, UK Quote Link to comment Share on other sites More sharing options...
lemming Posted October 1, 2012 Report Share Posted October 1, 2012 Hi Terry, I just came across this topic. Hopefully you already have a solution. Anyway, I played around with AHK and came up with a regex one-liner to glean the data you wanted. alltext := RegExReplace( alltext, "s)\[([^\]]+)\.jpg\][^\[]*- COMMENT -.{0,2}\n" , "$1.jpg`n" ) This regex captures the filename in between the square brackets, then it discards everything that is between .jpg and - COMMENT - (inclusive). Whatever that is left is the result you wanted. I've attached a zip with two files, an .ahk file (with comments) and a text file containing the data you posted. Both files need to be in the same folder. Terry data.zip Quote Link to comment Share on other sites More sharing options...
terrypin Posted October 1, 2012 Author Report Share Posted October 1, 2012 Thanks lemming, I'll try that soon, although I did get the original problem sorted a while ago. Your regex skills are a tad ahead of mine! (See my comments about this topic to Cory 9 months ago.) -- Terry, UK Quote Link to comment Share on other sites More sharing options...
acantor Posted October 1, 2012 Report Share Posted October 1, 2012 When I was messing with Regedit a few years ago, I was able to find excellent answers to most of my questions through one of the on-line forums.(I think it was an AutoHotkey subforum.) Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.