terrypin Posted September 14, 2017 Report Share Posted September 14, 2017 Anyone here done any work with HTML/XML/KML files please? This is part of a KML file from Google Earth: I've hit a block over extracting the name either by working directly on the file (after removing its tabs and/or its CRLFs) or using it with Text Begin Process. One obstacle is that MX Pro has no Regular Expression facility as far as I know. I've had to resort to having the macro open my text editor and work there. It works, but is very slow, intrusive and inelegant. With the entire code from the clipboard in a text variable tFullKML, I want to quickly and silently get the name inside the string shown. At its simplest, if stripped of tabs unique identification would require finding the two successive lines, where the content inside the name tags is unknown. <Placemark> <name>Heathrow</name> Or, stripped of both tabs and CRLFs it would be one very long line containing the string: <Placemark><name>Heathrow</name> Any ideas would be appreciated please. I'll report back promptly if I hit on a neat solution. Terry, East Grinstead, UK Quote Link to comment Share on other sites More sharing options...
Cory Posted September 14, 2017 Report Share Posted September 14, 2017 I do this sort of thing on a daily basis. When I started scraping in MEP I wrote routines to do this. Normally I would simply do the math. Find <name> and </name> and make the adjustments to extract the sub-string. Sometimes I needed more to it, especially if extracting multiple strings. You might also consider using FindStr. It's a command line utility that does RegEx. Or, if you're interested, I could start my RegEx utility add-on for MEP. I write a lot of scrapers and extract this kind of data by the millions but I do it all in .NET now. In fact I'm working on one now :-) Quote Link to comment Share on other sites More sharing options...
terrypin Posted September 14, 2017 Author Report Share Posted September 14, 2017 Hi Cory. Thanks for the fast response. In fact I had what I hope is an inspiration a few minutes after posting. Using Text File Process on tFullKML, with each line in tLine, I'll set a flag IF tLine = <Placemark>. I'll also do another to test IF the flag is on AND tLine contains <name> AND </name>. Then I'll remove the unwanted tags, leaving the result. If the latter test is negative I'll turn off the flag and keep looking. I'll also take a look at FindStr. Quote Link to comment Share on other sites More sharing options...
Cory Posted September 14, 2017 Report Share Posted September 14, 2017 I think finding the position and doing the math would be simpler. But whatever works best for you. Quote Link to comment Share on other sites More sharing options...
terrypin Posted September 15, 2017 Author Report Share Posted September 15, 2017 You're right, Cory. My method worked OK and was much faster than the original macro using an external text editor. But your approach working entirely within the variable was superior. I identified the positions of <Placemark><name>, removed unwanted preceding content, found </name> and removed succeeding content, leaving just the name . That was not only faster (typically a couple of seconds per file) but also more intuitive to code. I expect using .NET would give desirable sub-second speed. But my programming know how is decades out of date and I'm not up for the learning! However, I will make time to see if a method using FindStr has any edge. Footnote for any other GE users ending up here: There are some complications for which a macro like this needs to allow. In particular, 'Placemarks' come in various types, covering not only places (with or without links) but also paths (with or without timestamps), overlays, lines from the Ruler, mult-geometry objects, etc. So a search for <Placemark><name> won't work for all types. In fact, that's the only reason I need such a macro. For the great majority, a simple F2 on the target placemark in the My Places pane makes its name immediately available for copying. But I'm hoping that a couple more strings will cover all the stuff I use. Quote Link to comment Share on other sites More sharing options...
Cory Posted September 15, 2017 Report Share Posted September 15, 2017 .NET is milliseconds. In fact the default format for DataTable in .NET is XML. One command, E.G. dtMyTable.ReadXml(c:\DataFile.kml"). I have loaded huge data tables this way and it's instantaneous. I'm guessing the time it takes to read from disk is slower. But as long as it's fast enough, MEP works fine. When I would do mine I was ablative as well but a little different when there were multiples. I would find the XML/HTML tag location, add the length of the tag, and delete everything up to that. Then find the next left-angle-bracket and decrease one and that was the length. So then get sub-string from the first position to it. Then find the next and delete again. One must do the deletions because MEP's Variable Set Integer > Set to the Position of Text in a Text Variable command doesn't include a parameter for starting position. If it did, you could start each subsequent search after the position of the previous location. Hmmm. I'm going to request that as a feature... Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.