Jump to content
Macro Express Forums

Working with KML/HTML files


Recommended Posts

Anyone here done any work with HTML/XML/KML files please?

This is part of a KML file from Google Earth:

SearchForPlacemarkName-1.jpg.f5e9e4d94dc51820157263be6ad6ff94.jpg

I've hit a block over extracting the name either by working directly on the file (after removing its tabs and/or its CRLFs) or using it with Text Begin Process. One obstacle is that MX Pro has no Regular Expression facility as far as I know.

I've had to resort to having the macro open my text editor and work there. It works, but is very slow, intrusive and inelegant.

With the entire code from the clipboard in a text variable tFullKML, I want to quickly and silently get the name inside the string shown.

At its simplest, if stripped of tabs unique identification would require finding the two successive lines, where the content inside the name tags is unknown.

<Placemark>

<name>Heathrow</name>

Or, stripped of both tabs and CRLFs it would be one very long line containing the string:

<Placemark><name>Heathrow</name>

Any ideas would be appreciated please. I'll report back promptly if I hit on a neat solution.

Terry, East Grinstead, UK

Link to comment
Share on other sites

I do this sort of thing on a daily basis. When I started scraping in MEP I wrote routines to do this. Normally I would simply do the math. Find <name> and </name> and make the adjustments to extract the sub-string. Sometimes I needed more to it, especially if extracting multiple strings. 

You might also consider using FindStr. It's a command line utility that does RegEx. Or, if you're interested, I could start my RegEx utility add-on for MEP.

I write a lot of scrapers and extract this kind of data by the millions but I do it all in .NET now. In fact I'm working on one now :-)

Link to comment
Share on other sites

Hi Cory. Thanks for the fast response. In fact I had what I hope is an inspiration a few minutes after posting. Using Text File Process on tFullKML, with each line in tLine, I'll set a flag IF tLine = <Placemark>. I'll also do another to test IF the flag is on AND tLine contains <name> AND </name>. Then I'll remove the unwanted tags, leaving the result. If the latter test is negative I'll turn off the flag and keep looking.

I'll also take a look at FindStr.

 

Link to comment
Share on other sites

You're right, Cory. My method worked OK and was much faster than the original macro using an external text editor. But your approach working entirely within the variable was superior. I identified the positions of <Placemark><name>, removed unwanted preceding content, found </name> and removed succeeding content, leaving just the name . That was not only faster (typically a couple of seconds per file) but also more intuitive to code.

I expect using .NET would give desirable sub-second speed. But my programming know how is decades out of date and I'm not up for the learning! However, I will make time to see if a method using FindStr has any edge.

Footnote for any other GE users ending up here:
There are some complications for which a macro like this needs to allow. In particular, 'Placemarks' come in various types, covering not only places (with or without links) but also paths (with or without timestamps), overlays, lines from the Ruler, mult-geometry objects, etc. So a search for <Placemark><name> won't work for all types. In fact, that's the only reason I need such a macro. For the great majority, a simple F2 on the target placemark in the My Places pane makes its name immediately available for copying. But I'm hoping that a couple more strings will cover all the stuff I use.

Link to comment
Share on other sites

.NET is milliseconds. In fact the default format for DataTable in .NET is XML. One command, E.G. dtMyTable.ReadXml(c:\DataFile.kml"). I have loaded huge data tables this way and it's instantaneous. I'm guessing the time it takes to read from disk is slower. But as long as it's fast enough, MEP works fine. 

When I would do mine I was ablative as well but a little different when there were multiples. I would find the XML/HTML tag location, add the length of the tag, and delete everything up to that. Then find the next left-angle-bracket and decrease one and that was the length. So then get sub-string from the first position to it. Then find the next and delete again. One must do the deletions because MEP's Variable Set Integer > Set to the Position of Text in a Text Variable command doesn't include a parameter for starting position. If it did, you could start each subsequent search after the position of the previous location. Hmmm. I'm going to request that as a feature... 

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...