Jump to content
Macro Express Forums

Text editing - possibly with Regex


Recommended Posts

Any Regex experts around please?

 

As part of a macro I'm writing I have a text file that looks like this:

 

--- Start paste ---

[blackfordLane.jpg]

File name = BlackfordLane.jpg

Directory = C:\Docs\My Videos\PROJECTS\Thames Path Walk Projects\TP03 Project\Geograph Photos\GeoDay2\

Compression = JPEG, quality: 87, subsampling OFF

Resolution = 96 x 96 DPI

File date/time = 19/01/2012 / 15:01:23

 

- IPTC -

Object Name - s bridge over the River Thames is not a footbridge but carries pipes.

 

- COMMENT -

Thames Path on Blackford Lane heading towards Blackford Farm, east of Castle Eaton.

 

[Castle Eaton Church.jpg]

File name = Castle Eaton Church.jpg

Directory = C:\Docs\My Videos\PROJECTS\Thames Path Walk Projects\TP03 Project\Geograph Photos\GeoDay2\

Compression = JPEG, quality: 87, subsampling OFF

Resolution = 72 x 72 DPI

File date/time = 19/01/2012 / 14:03:55

 

- EXIF -

Make - FUJIFILM

Model - FinePix2600Zoom

Orientation - Top left

XResolution - 72

YResolution - 72

ResolutionUnit - Inch

 

- COMMENT -

Castle Eaton Church

 

[CastleEaton-2.jpg]

File name = CastleEaton-2.jpg

Directory = C:\Docs\My Videos\PROJECTS\Thames Path Walk Projects\TP03 Project\Geograph Photos\GeoDay2\

Compression = JPEG, quality: 75

Resolution = 0 x 0 DPI

File date/time = 18/01/2012 / 15:40:05

 

- COMMENT -

The Red Lion, Castle Eaton

A warm welcoming pub on a cold winter's day, with the River Thames running at the bottom of the garden.

--- End paste ---

 

etc

 

This is what I want to get as a result:

 

BlackfordLane.jpg

Thames Path on Blackford Lane heading towards Blackford Farm, east of Castle Eaton.

 

Castle Eaton Church.jpg

Castle Eaton Church

 

CastleEaton-2.jpg

The Red Lion, Castle Eaton

A warm welcoming pub on a cold winter's day, with the River Thames running at the bottom of the garden.

 

etc

 

My first line of attack is to try for a Regex expression that will Find everything (for example) between the ']' of

'[blackfordLane.jpg]' and the '-' of '- COMMENT -'? That would leave only

a little tidying up.

 

But so far it's eluded me after a couple of hours. The best I could come up with was the following to delete all lines from File name... to

File date/time (with the Replace box empty):

 

File name = .*\nDirectory = .*\nCompression = .*\nResolution = .*\nImage dimensions = .*\nPrint size = .*\nColor depth = .*\nNumber of unique colors = .*\nDisk size = .*\nCurrent memory size = .*\nFile date/time = .*\n

 

But that's only part of the task and seems very inelegant.

 

Any suggestions please?

 

I'm also about to post in a couple of Regex forums.

 

--

Terry, East Grinstead, UK

Link to comment
Share on other sites

I am no expert but this is an easy one compared to the expressions I have been writing lately for my scrapers.

 

In VB.NET I would create a match collection for these then loop thru each and reassemble into a string variable accumulator. I tested these and they all work perfectly.

 

Filename:

I’m finding what’s between the square brackets. But in RegEx one must escape the square brackets because square brackets are a special character. Parenthesis create a backreference to Group 1 so you will only get the file name and not the square brackets. Remember that group numbering is not zero based in RegEx.

\[(.*)\]

 

Comment:

In this case you need to use the ‘dot matches newline’ and make it lazy with a question mark.

- COMMENT -\r\n(.*?)\r\n \r\n

Link to comment
Share on other sites

I am no expert but this is an easy one compared to the expressions I have been writing lately for my scrapers.

 

In VB.NET I would create a match collection for these then loop thru each and reassemble into a string variable accumulator. I tested these and they all work perfectly.

 

Filename:

I’m finding what’s between the square brackets.

 

But in RegEx one must escape the square brackets because square brackets are a special character. Parenthesis create a backreference to Group 1 so you will only get the file name and not the square brackets. Remember that group numbering is not zero based in RegEx.

\[(.*)\]

 

Comment:

In this case you need to use the ‘dot matches newline’ and make it lazy with a question mark.

- COMMENT -\r\n(.*?)\r\n \r\n

 

Thanks, I'll study that and experiment.

 

BTW, what are 'scrapers'?

 

Edit: More important, what is that \r? TextPad's Regex (the POSIX variety aparently) doesn't seem to include that option. Could you spell out what that code is specifying please, and I'll see if there's an equivalent to '\r' I can use.

 

Edit 2: I'm guessing that \r is a CR? In which case I don't see why I need it, even if TextPad supported it? Isn't \n (Return) sufficient? Anyway, I eventually tried

- COMMENT -\n(.*?)\n

which is OK up to a point. But it doesn't find the second line of comment if there is one, such as for CastleEaton-2.jpg

 

--

Terry, East Grinstead, UK

Link to comment
Share on other sites

Scrapers are applications that in some way collect and organize data from some bigger and often unorganized data source. In the old days with terminal emulators we would 'screen-scrape' mainframes that didn't have a proper reporting system. EG go thru every account and collect the past due balances of ever active account. But these days most scrapers are web bots that collect the data from web pages. For instance the one I've been most actively evolving retrieves tax record data from a multitude of county websites in different formats and outputs one data file with the results. The test sets I run are usually only a few thousand records but I know my client ran one recently with 650k records. Way beyond the scope of MEP. Besides the fact that it runs 1 to 2 orders of magnitude faster it never bombs out or has any timing issues. And since I develop in VB.NET I can give my client an executable to run and not rely on having MEP installed.

 

Check out http://www.regular-expressions.info/ and RegEx Buddy. RegEx buddy will help you build an expression and even has a wizard to generate the code for you in a variety of languages. I don't use the wizards so I can't say how well they work. But the expression builder/tester makes it so much easier to see what's what.

 

\r is a Carriage Return.

\n is Line Feed. N is for Newline

 

In Unix and other internet based systems often use just the Line Feed whereas Windows has CRLF.

 

The main problem here is that you really should have attached a file instead of pasting it in the message so I can see the invisible characters. And I don't know if the forum software is doing replacements. Or at least encapsulate it with code tags. Alternatively you could use the opening square bracket and trim. But in the example I gave you the hex editor revealed that the last blank line contained one space. So to detect the first blank line I and trapping on EoL > space > EoL. EG "\r\n \r\n" and the capturing parenthesis are such that I do so without returning these characters.

 

Are you actually using TexPad to do the extraction? My 2¢: Don't. I would write it as a VBScript. You can even embed that into an MEP macro if you like and run as External Script. At least I believe VBScript has support for RegEx.

Link to comment
Share on other sites

Thanks for that helpful follow-up.

 

But, as I've mentioned a few times, I'm not a programmer. So VBScript is not in my repertoire of tools!

 

I have Regex Coach but I'll certainly check out RegEx Buddy.

 

I have also just installed Gawk, on recommendation elsewhere. It looks powerful (and did solve the immediate requirement, albeit working in copy/paste mode!). But again it would require major effort to learn.

 

--

Terry, East Grinstead, UK

Link to comment
Share on other sites

If you like I can create some RegEx for you. It is difficult to understand at first but if you have someone create the code it's easy to see how you can modify it to your needs without having to understand how it all works. For instance I could write a VBScript that you could set simple MEP variables for that will return all the email addresses on a web page or something like that. And doing a RegEx that gets the match collection from a string is only 2-3 lines of code. Super simple. And you don't need a supporting program and it happens all from within your macro invisibly.

Link to comment
Share on other sites

Thanks Cory, that's generous of you, but I'm going to try to manage this sort of stuff myself if possible. I've been using RegEx for some years now and I'm OK with it for most purposes I encounter. I suppose I use it in TextPad a couple of times a week on average, with bursts of activity for some projects. The challenge arises with tasks like the one that prompted this thread. And it's now clear after research in the TextPad forum that its implementation in that otherwise excellent editor falls well short of the more powerful repertoire of PERLE etc. In particular TextPad's RegEx can't easily find/replace multi-line returns.

 

I've now supplemented it by a sister program, WildEdit that doesn't suffer that limitation. So I now have several approaches to problems like this:

1. Continue to use Textpad, with which I'm so familiar, and complement it with a macro to handle the whole file.

2. Use WildEdit alone.

3. Use AutoHotkey, a scripting language in which I'm dipping my toes, and which has strong RegEx as far as I can gather. It also benefits from a very active and helpful forum.

 

Oh, and

4. Post here for help from you!

 

One pre-requisite of course is a decent grasp of RegEx itself. A root snag is that my skill level waxes and wanes. It was pretty good a few years ago, after some intensive study motivated by some particular project or curiousity. But then months elapse when I have no need for anything more than simple stuff, and I forget 90% of it again!

 

 

--

Terry, East Grinstead, UK

Link to comment
Share on other sites

3. Use AutoHotkey, a scripting language in which I'm dipping my toes, and which has strong RegEx as far as I can gather. It also benefits from a very active and helpful forum.

AutoHotkey is a very useful utility, and I suspect there are some things (mainly keyboard and hotkey orientated) that cannot be done so easily in other languages like AutoIt (which shares some common ancestry with AutoHotkey). But as far as language constructs and syntax are concerned, AutoHotkey really is truly horrible. It's totally non-standard; one example is of text strings that included embedded spaces, which, in most cases, don't require (and must not use) surrounding quotation marks.

I'm certain that AutoIt is at least as powerful as AutoHotkey in its handling of regular expressions, and I firmly believe that the AutoIt Help documentation is superior to that in AHK, and that the AutoIt forums are supported in a more professional way.

Link to comment
Share on other sites

Thanks Paul, I'll take another look at AutoiIt.

 

Maybe it's not typical, but on each of several attempts this morning its forum at http://www.autoitscript.com/forum/ has taken 30-60 seconds to appear.

 

Edit:

An hour on and connection now seems much faster.

 

With apologies for getting further OT, a couple of queries:

1. I have a dozen or so AHK scripts in regular use. Presumably there will be no 'conflict' if I run those as well as AutoIt scripts?

2. But ideally I'd like to settle on just ONE of those tools. Are there any aids to help convert AHK to AutoIt?

3. Forum seems just as active as AHK's and many resources available. Of the scores of 'tutorials', can you recommend one or two to get me started please?

 

--

Terry, East Grinstead, UK

Link to comment
Share on other sites

With apologies for getting further OT, a couple of queries:

1. I have a dozen or so AHK scripts in regular use. Presumably there will be no 'conflict' if I run those as well as AutoIt scripts?

No conflicts occur.

2. But ideally I'd like to settle on just ONE of those tools. Are there any aids to help convert AHK to AutoIt?

Not that I know of. But I am willing, able and available (for a fee) to help you with this, or to do the conversions for you if you don't have the time.

3. Forum seems just as active as AHK's and many resources available. Of the scores of 'tutorials', can you recommend one or two to get me started please?

Wiki recommended

Regular Expressions

Function documentation

Regular Expressions

Forums search for StringRegExp

Link to comment
Share on other sites

  • 7 months later...

Hi Terry, I just came across this topic. Hopefully you already have a solution.

 

Anyway, I played around with AHK and came up with a regex one-liner to glean the data you wanted.

 

alltext := RegExReplace( alltext, "s)\[([^\]]+)\.jpg\][^\[]*- COMMENT -.{0,2}\n" , "$1.jpg`n" )

 

This regex captures the filename in between the square brackets, then it discards everything that is between .jpg and - COMMENT - (inclusive). Whatever that is left is the result you wanted. I've attached a zip with two files, an .ahk file (with comments) and a text file containing the data you posted. Both files need to be in the same folder.

Terry data.zip

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
 Share

×
×
  • Create New...