Jump to content
Macro Express Forums

Need help with macro plan - string parsing


Recommended Posts

I need to clean up a text file that is an index (like a book index, an alphabetical list of words and their page numbers). I want to strip out the single alpha heading (a, b, c, d, etc) and the page numbers. The output can have one or more page numbers and ranges (e.g., ", 21", after "abs" and ", 10, 15, 17, 19-22" after the word "array" in the example below).

 

Example:

Quote

A
AAA, 2, 6
abs, 21
Accessors, 1, 5
acos, 2, 7
Algorithm, 6
arcsegs, 15
arg, 19
args, 16
arguments, 23
Aribitrary, 6
Array, 3
array, 10, 15, 17, 19-22
arrays, 10, 19-21
asin, 2, 7
assembly, 14
atan, 2, 7-8

 

B
BBox, 12-13

 

The alpha head is always one letter. The individual index lines are comma separated.

 

The ideal result for that example would be:

Quote

AAA
abs
Accessors
acos
Algorithm
arcsegs
arg
args
arguments
Aribitrary
Array
array
arrays
asin
assembly
atan


BBox

 

To remove the single-letter alpha head: Is it possible to search for a single letter [a-z] + CR, then delete that letter + CR?

 

Is there a better way?

 

To delete the page numbers... and here, I'm stuck -- need problem-solving approaches and/or any suggestions.

Link to comment
Share on other sites

1 hour ago, dgehman said:

I'm going to have to learn how to search through the help file. "Parse" & "Parsing" didn't bring many hits.

Getting familiar with the ME commands is something of a project.  You almost have to go through them one by one and look at all the options of each command, and use the Help file in conjunction with that.  It's often not obvious how the low-level commands can fit together to do a general function like "Parse". 

Link to comment
Share on other sites

9 minutes ago, rberq said:

Getting familiar with the ME commands is something of a project.  You almost have to go through them one by one and look at all the options of each command, and use the Help file in conjunction with that.  It's often not obvious how the low-level commands can fit together to do a general function like "Parse". 

Exactly right. Poking through docs used to be fun for me... today, it's more like work. I haven't been active with Macro Express for literally years - I remember coming away with the sense that you could do anything with it.

 

1 minute ago, rberq said:

If non-header lines ALWAYS contain a comma, you could detect a header by

 

cmdc.JPG

Maybe you meant NEVER contains? Luckily, it does have that attribute - or non-attribute, I guess -- no comma, anyway.

Link to comment
Share on other sites

11 minutes ago, dgehman said:

Poking through docs used to be fun for me... today, it's more like work

Still fun for me, because I don't HAVE to do it.😉  I have a web page I parse every day, to update a spreadsheet.  But between the page publisher and Firefox, subtle differences  appear every few weeks.  At least adjusting to the changes keeps me familiar with the macros ....

Link to comment
Share on other sites

Here's my version, which works OK here. It's heavily commented so hopefully easy to follow, but post if anything unclear.

 

// These variables will be used to end each line of the edited file
Variable Set to ASCII Char 13 to %CR% // Set CR
Variable Set to ASCII Char 10 to %LF% // Set LF
Variable Set String %CRLF% to "%CR%%LF%" // Set combined CRLF
// Split each the line into as many parts as the max expected (up to 99). I've used 10.
Text File Begin Process: C:\Users\terry\Dropbox\Macro Express (Sundry)\Parsing-1.txt
// Test if line is blank.

  If Variable %tLine% Equals ""
  // Include a blank line and bypass the split.
    Variable Set String %tArray[1]% to ""
    Goto:AfterSplit
  End If
  // If there is no comma in the line AND it is not blank, ignore it and go to the next.

  If Variable %tLine% Does not Contain ","
    AND
  If Variable %tLine% Does not Equal ""
    Continue
  End If
  // Otherwise proced to parse.
  Split String "%tLine%" on "," into %tArray%, starting at 1
  // But only use the first element, tArray[1]. Ignore the rest.
  // Build the result, stating with an empty new file.
  // Add a new line.
  :AfterSplit
  Variable Modify String %tEditedFile%: Append Text String Variable (%tArray[1]%)
  // Add the EOL characters
  Variable Modify String %tEditedFile%: Append Text String Variable (%CRLF%)
Text File End Process
Variable Modify String: Save %tEditedFile% to "C:\Users\terry\Dropbox\Macro Express (Sundry)\EditedFile.txt"
Beep: Save %tEditedFile% to "C:\Users\terry\Dropbox\Macro Express (Sundry)\EditedFile.txt" // End of macro, so edited file sholud be ready.

 

Full code:

 

<COMMENT Value="These variables will be used to end each line of the edited file"/>
<VARIABLE SET TO ASCII CHAR Value="13" Destination="%CR%" _COMMENT="Set CR"/>
<VARIABLE SET TO ASCII CHAR Value="10" Destination="%LF%" _COMMENT="Set LF"/>
<VARIABLE SET STRING Option="\x00" Destination="%CRLF%" Value="%CR%%LF%" NoEmbeddedVars="FALSE" _COMMENT="Set combined CRLF"/>
<COMMENT Value="Split each the line into as many parts as the max expected (up to 99). I've used 10."/>
<TEXT FILE BEGIN PROCESS Filename="C:\\Users\\terry\\Dropbox\\Macro Express (Sundry)\\Parsing-1.txt" Start_Record="1" Process_All="TRUE" Records="1" Variable="%tLine%"/>
<COMMENT Value="Test if line is blank.\r\n"/>
<IF VARIABLE Variable="%tLine%" Condition="\x00" IgnoreCase="FALSE"/>
<COMMENT Value="Include a blank line and bypass the split."/>
<VARIABLE SET STRING Option="\x00" Destination="%tArray[1]%" NoEmbeddedVars="FALSE"/>
<GOTO Name="AfterSplit"/>
<END IF/>
<COMMENT Value="If there is no comma in the line AND it is not blank, ignore it and go to the next.\r\n"/>
<IF VARIABLE Variable="%tLine%" Condition="\x07" Value="," IgnoreCase="FALSE"/>
<AND/>
<IF VARIABLE Variable="%tLine%" Condition="\x01" IgnoreCase="FALSE"/>
<CONTINUE/>
<END IF/>
<COMMENT Value="Otherwise proced to parse."/>
<SPLIT STRING Source="%tLine%" SplitChar="," Dest="%tArray%" Index="1"/>
<COMMENT Value="But only use the first element, tArray[1]. Ignore the rest."/>
<COMMENT Value="Build the result, stating with an empty new file."/>
<COMMENT Value="Add a new line."/>
<LABEL Name="AfterSplit"/>
<VARIABLE MODIFY STRING Option="\x07" Destination="%tEditedFile%" Variable="%tArray[1]%" NoEmbeddedVars="FALSE"/>
<COMMENT Value="Add the EOL characters"/>
<VARIABLE MODIFY STRING Option="\x07" Destination="%tEditedFile%" Variable="%CRLF%" NoEmbeddedVars="FALSE"/>
<TEXT FILE END PROCESS/>
<VARIABLE MODIFY STRING Option="\x11" Destination="%tEditedFile%" Filename="C:\\Users\\terry\\Dropbox\\Macro Express (Sundry)\\EditedFile.txt" Strip="FALSE" NoEmbeddedVars="FALSE"/>
<BEEP _COMMENT="End of macro, so edited file sholud be ready."/>

 

  • Thanks 1
Link to comment
Share on other sites

This solution reminds me a lot of Terry's. You'll need to create two text files and specify their paths and names in the script: the first file contains the original text that will be processed, and the second file is an empty text file.

 

Fun project! What's the book about?

 

Variable Set Integer %x% to 1
Text File Begin Process: C:\Tmp\File Start.txt // This text file contains the index you want to process
  If Variable %Line[%x%]% Contains ","
    Variable Set Integer %CommaPosition% to the position of "," in %Line[%x%]%
    Variable Modify Integer: %CommaPosition% = %CommaPosition% - 1
    Variable Modify String: Copy part of text in %Line[%x%]% starting at 1 and %CommaPosition% characters long to %Line[%x%]%
  End If
  Variable Modify String: Append %Line[%x%]% to text file, "C:\Tmp\File End.txt" // This file receives the results
Text File End Process
<VARIABLE SET INTEGER Option="\x00" Destination="%x%" Value="1"/>
<TEXT FILE BEGIN PROCESS Filename="C:\\Tmp\\File Start.txt" Start_Record="1" Process_All="TRUE" Records="1" Variable="%Line[%x%]%" _COMMENT="This text file contains the index you want to process"/>
<IF VARIABLE Variable="%Line[%x%]%" Condition="\x06" Value="," IgnoreCase="FALSE"/>
<VARIABLE SET INTEGER Option="\x0E" Destination="%CommaPosition%" Text_Variable="%Line[%x%]%" Text="," Ignore_Case="FALSE"/>
<VARIABLE MODIFY INTEGER Option="\x01" Destination="%CommaPosition%" Value1="%CommaPosition%" Value2="1"/>
<VARIABLE MODIFY STRING Option="\x09" Destination="%Line[%x%]%" Variable="%Line[%x%]%" Start="1" Count="%CommaPosition%" NoEmbeddedVars="FALSE"/>
<END IF/>
<VARIABLE MODIFY STRING Option="\x12" Destination="%Line[%x%]%" Filename="C:\\Tmp\\File End.txt" Strip="TRUE" NoEmbeddedVars="FALSE" _COMMENT="This file receives the results"/>
<TEXT FILE END PROCESS/>

 

Link to comment
Share on other sites

Back at my PC and I see that the only change needed is trivial, but I'll show the full code for the OP.

 

Variable Set Integer %x% to 1
Text File Begin Process: C:\Tmp\File Start.txt // This text file contains the index you want to process
  If Variable %Line[%x%]% Contains ","
    Variable Set Integer %CommaPosition% to the position of "," in %Line[%x%]%
    Variable Modify Integer: %CommaPosition% = %CommaPosition% - 1
    Variable Modify String: Copy part of text in %Line[%x%]% starting at 1 and %CommaPosition% characters long to %Line[%x%]%
    Variable Modify String: Append %Line[%x%]% to text file, "C:\Tmp\File End.txt" // This file receives the results
  Else
  // If line does not contain a comma it must be a header capital.
  // So it needs to be replaced by a blank line.
    Variable Modify String: Append %Blank% to text file, "C:\Tmp\File End.txt"
  End If
Text File End Process

 

<VARIABLE SET INTEGER Option="\x00" Destination="%x%" Value="1"/>
<TEXT FILE BEGIN PROCESS Filename="C:\\Tmp\\File Start.txt" Start_Record="1" Process_All="TRUE" Records="1" Variable="%Line[%x%]%" _COMMENT="This text file contains the index you want to process"/>
<IF VARIABLE Variable="%Line[%x%]%" Condition="\x06" Value="," IgnoreCase="FALSE"/>
<VARIABLE SET INTEGER Option="\x0E" Destination="%CommaPosition%" Text_Variable="%Line[%x%]%" Text="," Ignore_Case="FALSE"/>
<VARIABLE MODIFY INTEGER Option="\x01" Destination="%CommaPosition%" Value1="%CommaPosition%" Value2="1"/>
<VARIABLE MODIFY STRING Option="\x09" Destination="%Line[%x%]%" Variable="%Line[%x%]%" Start="1" Count="%CommaPosition%" NoEmbeddedVars="FALSE"/>
<VARIABLE MODIFY STRING Option="\x12" Destination="%Line[%x%]%" Filename="C:\\Tmp\\File End.txt" Strip="TRUE" NoEmbeddedVars="FALSE" _COMMENT="This file receives the results"/>
<ELSE/>
<COMMENT Value="If line does not contain a comma it must be a header capital."/>
<COMMENT Value="So it needs to be replaced by a blank line."/>
<VARIABLE MODIFY STRING Option="\x12" Destination="%Blank%" Filename="C:\\Tmp\\File End.txt" Strip="TRUE" NoEmbeddedVars="FALSE"/>
<END IF/>
<TEXT FILE END PROCESS/>

 

Link to comment
Share on other sites

You're right, Terry! Much better. Here's my version, which is functionally identical to yours:

 

 

 

Variable Set Integer %x% to 1
Text File Begin Process: C:\Tmp\File Start.txt // This is the text file that contains the index you want to process
  If Variable %Line[%x%]% Contains ","
    Variable Set Integer %CommaPosition% to the position of "," in %Line[%x%]%
    Variable Modify Integer: %CommaPosition% = %CommaPosition% - 1
    Variable Modify String: Copy part of text in %Line[%x%]% starting at 1 and %CommaPosition% characters long to %Line[%x%]%
  Else
    Variable Set String %Line[%x%]% to ""
  End If
  Variable Modify String: Append %Line[%x%]% to text file, "C:\Tmp\File End.txt" // This is the file that will receive the results
Text File End Process
<VARIABLE SET INTEGER Option="\x00" Destination="%x%" Value="1"/>
<TEXT FILE BEGIN PROCESS Filename="C:\\Tmp\\File Start.txt" Start_Record="1" Process_All="TRUE" Records="1" Variable="%Line[%x%]%" _COMMENT="This is the text file that contains the index you want to process"/>
<IF VARIABLE Variable="%Line[%x%]%" Condition="\x06" Value="," IgnoreCase="FALSE"/>
<VARIABLE SET INTEGER Option="\x0E" Destination="%CommaPosition%" Text_Variable="%Line[%x%]%" Text="," Ignore_Case="FALSE"/>
<VARIABLE MODIFY INTEGER Option="\x01" Destination="%CommaPosition%" Value1="%CommaPosition%" Value2="1"/>
<VARIABLE MODIFY STRING Option="\x09" Destination="%Line[%x%]%" Variable="%Line[%x%]%" Start="1" Count="%CommaPosition%" NoEmbeddedVars="FALSE"/>
<ELSE/>
<VARIABLE SET STRING Option="\x00" Destination="%Line[%x%]%" NoEmbeddedVars="FALSE"/>
<END IF/>
<VARIABLE MODIFY STRING Option="\x12" Destination="%Line[%x%]%" Filename="C:\\Tmp\\File End.txt" Strip="TRUE" NoEmbeddedVars="FALSE" _COMMENT="This is the file that will receive the results"/>
<TEXT FILE END PROCESS/>

 

  • Thanks 1
Link to comment
Share on other sites

Amazing posts - thanks again, acantor and Terrypin

 

12 hours ago, acantor said:

Fun project! What's the book about?

 

 

Actually, it's not a book, but one topic from a Web-base help file as created in Help+Manual, one of the better old school help authoring tools (old school, meaning not browser-based & online, but installed as a program in your local computer).

 

Long story about to follow... sorry...

 

H+M has an input pane outside of the main window, the main one being where you enter the topic text & graphics. This secondary pane is labeled "keywords." When the help file is published as Web-based HTML output, keywords are merged into an accompanying index and search database that allows the viewer to search on terms in the help file. Putting the single-letter rubrics and the page numbers into the Keywords pane would make for really confusing results when a viewer searched on, say, "array" -- the page numbers would be a distraction, a puzzle.

 

The particular topic in my original post is a long and highly complex reference topic for a software function catchall (with many functions detailed) that's part of the development language of my company's rule-based design/mechanical engineering software. 

 

There are several similarly long topics that I need to index -- that is, extract the keywords. Doing this manually -- selecting each word and hitting CTRL+k for each) -- gets old and  can get confusing, So I bought a copy of PDF Index Generator... publish the long topic as a PDF... run it through PDF Index Generator... output it as a text file.

 

The Macro Express function will be to strip the one-letter rubrics and the page numbers, so I can simply copy the text file and drop it into H+M's Keyword entry pane.

 

Described this way, it begins to look even to me like madness... but it will still be better than manually scouring through some very long topics to mark keywords.

 

Besides, I like to watch computers work.

 

(It might be more sane just to record/write a simple Macro Express function to double-click the word where the cursor is placed, then keyboard (Text Type) CTRL+k. But that still won't stop my mind from wandering or losing track and creating duplicates.)

Link to comment
Share on other sites

1 hour ago, dgehman said:

It might be more sane just to record/write a simple Macro Express function to double-click the word where the cursor is placed, then keyboard (Text Type) CTRL+k.

 

Here are a few clues that make make this feasible:

 

1. In Word, there is no need to select a word before pressing Ctrl + K. By default, the "Insert Hyperlink" acts on the word that has keyboard focus.

 

2. But there is a bug in how Word defines the boundaries of words. This problem affects many of Word's built in commands, including "Insert Hyperlink." The command doesn't recognize a word if the insertion point (aka the cursor) is parked before the first character in the word or after the last character of the word.

 

For example, the word "this" that follows will not be detected if the cursor is before the first letter or after the last character. I'm indicating insertion points with the "|"

 

Hello |this is a test. [Fails]

Hello t|his is a test. [Succeeds]

Hello th|is is a test. [Succeeds]

Hello thi|s is a test. [Succeeds]

Hello this| is a test. [Fails]
 

3. The workaround is to select the word. But here's a trick. If the cursor is anywhere within the word, press "F8" twice to select it.  So your Macro Express script might look something like this:

 

Text Type (Simulate Keystrokes): <ESC> // Exit "select mode" (if it happens to be on)
Text Type (Simulate Keystrokes): <F8><F8> // Select the current word by activating "select mode" twice
Text Type (Simulate Keystrokes): <CONTROL>k

 

 

 

 

 

Link to comment
Share on other sites

Ah - would it were Word. If that makes sense.

 

The authoring environment uses its own home-built word processor, where the sequence really is to select the whole word (double-click), then hit CTRL+k.

 

But I'll be able to use your ideas and post when I'm working with Word. I wasn't aware of the cursor-anywhere-in-word shortcut. Thanks!

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...