dgehman Posted February 25, 2021 Report Share Posted February 25, 2021 I need to clean up a text file that is an index (like a book index, an alphabetical list of words and their page numbers). I want to strip out the single alpha heading (a, b, c, d, etc) and the page numbers. The output can have one or more page numbers and ranges (e.g., ", 21", after "abs" and ", 10, 15, 17, 19-22" after the word "array" in the example below). Example: Quote A AAA, 2, 6 abs, 21 Accessors, 1, 5 acos, 2, 7 Algorithm, 6 arcsegs, 15 arg, 19 args, 16 arguments, 23 Aribitrary, 6 Array, 3 array, 10, 15, 17, 19-22 arrays, 10, 19-21 asin, 2, 7 assembly, 14 atan, 2, 7-8 B BBox, 12-13 The alpha head is always one letter. The individual index lines are comma separated. The ideal result for that example would be: Quote AAA abs Accessors acos Algorithm arcsegs arg args arguments Aribitrary Array array arrays asin assembly atan BBox To remove the single-letter alpha head: Is it possible to search for a single letter [a-z] + CR, then delete that letter + CR? Is there a better way? To delete the page numbers... and here, I'm stuck -- need problem-solving approaches and/or any suggestions. Quote Link to comment Share on other sites More sharing options...
rberq Posted February 25, 2021 Report Share Posted February 25, 2021 To delete the page numbers, find the position of the first comma and delete everything from there to the end of the line. (The first image below should have % signs around the variable names -- my error, sorry.) 1 Quote Link to comment Share on other sites More sharing options...
dgehman Posted February 25, 2021 Author Report Share Posted February 25, 2021 Thanks! - never would have thought of the problem as solvable via variable + deletion. I'm going to have to learn how to search through the help file. "Parse" & "Parsing" didn't bring many hits. Quote Link to comment Share on other sites More sharing options...
rberq Posted February 25, 2021 Report Share Posted February 25, 2021 1 hour ago, dgehman said: I'm going to have to learn how to search through the help file. "Parse" & "Parsing" didn't bring many hits. Getting familiar with the ME commands is something of a project. You almost have to go through them one by one and look at all the options of each command, and use the Help file in conjunction with that. It's often not obvious how the low-level commands can fit together to do a general function like "Parse". Quote Link to comment Share on other sites More sharing options...
rberq Posted February 25, 2021 Report Share Posted February 25, 2021 2 hours ago, dgehman said: To remove the single-letter alpha head If non-header lines ALWAYS contain a comma, you could detect a header by Quote Link to comment Share on other sites More sharing options...
dgehman Posted February 25, 2021 Author Report Share Posted February 25, 2021 9 minutes ago, rberq said: Getting familiar with the ME commands is something of a project. You almost have to go through them one by one and look at all the options of each command, and use the Help file in conjunction with that. It's often not obvious how the low-level commands can fit together to do a general function like "Parse". Exactly right. Poking through docs used to be fun for me... today, it's more like work. I haven't been active with Macro Express for literally years - I remember coming away with the sense that you could do anything with it. 1 minute ago, rberq said: If non-header lines ALWAYS contain a comma, you could detect a header by Maybe you meant NEVER contains? Luckily, it does have that attribute - or non-attribute, I guess -- no comma, anyway. Quote Link to comment Share on other sites More sharing options...
rberq Posted February 25, 2021 Report Share Posted February 25, 2021 11 minutes ago, dgehman said: Poking through docs used to be fun for me... today, it's more like work Still fun for me, because I don't HAVE to do it.😉 I have a web page I parse every day, to update a spreadsheet. But between the page publisher and Firefox, subtle differences appear every few weeks. At least adjusting to the changes keeps me familiar with the macros .... Quote Link to comment Share on other sites More sharing options...
terrypin Posted February 25, 2021 Report Share Posted February 25, 2021 Here's my version, which works OK here. It's heavily commented so hopefully easy to follow, but post if anything unclear. // These variables will be used to end each line of the edited file Variable Set to ASCII Char 13 to %CR% // Set CR Variable Set to ASCII Char 10 to %LF% // Set LF Variable Set String %CRLF% to "%CR%%LF%" // Set combined CRLF // Split each the line into as many parts as the max expected (up to 99). I've used 10. Text File Begin Process: C:\Users\terry\Dropbox\Macro Express (Sundry)\Parsing-1.txt // Test if line is blank. If Variable %tLine% Equals "" // Include a blank line and bypass the split. Variable Set String %tArray[1]% to "" Goto:AfterSplit End If // If there is no comma in the line AND it is not blank, ignore it and go to the next. If Variable %tLine% Does not Contain "," AND If Variable %tLine% Does not Equal "" Continue End If // Otherwise proced to parse. Split String "%tLine%" on "," into %tArray%, starting at 1 // But only use the first element, tArray[1]. Ignore the rest. // Build the result, stating with an empty new file. // Add a new line. :AfterSplit Variable Modify String %tEditedFile%: Append Text String Variable (%tArray[1]%) // Add the EOL characters Variable Modify String %tEditedFile%: Append Text String Variable (%CRLF%) Text File End Process Variable Modify String: Save %tEditedFile% to "C:\Users\terry\Dropbox\Macro Express (Sundry)\EditedFile.txt" Beep: Save %tEditedFile% to "C:\Users\terry\Dropbox\Macro Express (Sundry)\EditedFile.txt" // End of macro, so edited file sholud be ready. Full code: <COMMENT Value="These variables will be used to end each line of the edited file"/> <VARIABLE SET TO ASCII CHAR Value="13" Destination="%CR%" _COMMENT="Set CR"/> <VARIABLE SET TO ASCII CHAR Value="10" Destination="%LF%" _COMMENT="Set LF"/> <VARIABLE SET STRING Option="\x00" Destination="%CRLF%" Value="%CR%%LF%" NoEmbeddedVars="FALSE" _COMMENT="Set combined CRLF"/> <COMMENT Value="Split each the line into as many parts as the max expected (up to 99). I've used 10."/> <TEXT FILE BEGIN PROCESS Filename="C:\\Users\\terry\\Dropbox\\Macro Express (Sundry)\\Parsing-1.txt" Start_Record="1" Process_All="TRUE" Records="1" Variable="%tLine%"/> <COMMENT Value="Test if line is blank.\r\n"/> <IF VARIABLE Variable="%tLine%" Condition="\x00" IgnoreCase="FALSE"/> <COMMENT Value="Include a blank line and bypass the split."/> <VARIABLE SET STRING Option="\x00" Destination="%tArray[1]%" NoEmbeddedVars="FALSE"/> <GOTO Name="AfterSplit"/> <END IF/> <COMMENT Value="If there is no comma in the line AND it is not blank, ignore it and go to the next.\r\n"/> <IF VARIABLE Variable="%tLine%" Condition="\x07" Value="," IgnoreCase="FALSE"/> <AND/> <IF VARIABLE Variable="%tLine%" Condition="\x01" IgnoreCase="FALSE"/> <CONTINUE/> <END IF/> <COMMENT Value="Otherwise proced to parse."/> <SPLIT STRING Source="%tLine%" SplitChar="," Dest="%tArray%" Index="1"/> <COMMENT Value="But only use the first element, tArray[1]. Ignore the rest."/> <COMMENT Value="Build the result, stating with an empty new file."/> <COMMENT Value="Add a new line."/> <LABEL Name="AfterSplit"/> <VARIABLE MODIFY STRING Option="\x07" Destination="%tEditedFile%" Variable="%tArray[1]%" NoEmbeddedVars="FALSE"/> <COMMENT Value="Add the EOL characters"/> <VARIABLE MODIFY STRING Option="\x07" Destination="%tEditedFile%" Variable="%CRLF%" NoEmbeddedVars="FALSE"/> <TEXT FILE END PROCESS/> <VARIABLE MODIFY STRING Option="\x11" Destination="%tEditedFile%" Filename="C:\\Users\\terry\\Dropbox\\Macro Express (Sundry)\\EditedFile.txt" Strip="FALSE" NoEmbeddedVars="FALSE"/> <BEEP _COMMENT="End of macro, so edited file sholud be ready."/> 1 Quote Link to comment Share on other sites More sharing options...
dgehman Posted February 25, 2021 Author Report Share Posted February 25, 2021 Thanks! I'd forgotten how helpful this place could me. Next step for me: parse your parsing -- very much appreciated. Quote Link to comment Share on other sites More sharing options...
acantor Posted February 26, 2021 Report Share Posted February 26, 2021 This solution reminds me a lot of Terry's. You'll need to create two text files and specify their paths and names in the script: the first file contains the original text that will be processed, and the second file is an empty text file. Fun project! What's the book about? Variable Set Integer %x% to 1 Text File Begin Process: C:\Tmp\File Start.txt // This text file contains the index you want to process If Variable %Line[%x%]% Contains "," Variable Set Integer %CommaPosition% to the position of "," in %Line[%x%]% Variable Modify Integer: %CommaPosition% = %CommaPosition% - 1 Variable Modify String: Copy part of text in %Line[%x%]% starting at 1 and %CommaPosition% characters long to %Line[%x%]% End If Variable Modify String: Append %Line[%x%]% to text file, "C:\Tmp\File End.txt" // This file receives the results Text File End Process <VARIABLE SET INTEGER Option="\x00" Destination="%x%" Value="1"/> <TEXT FILE BEGIN PROCESS Filename="C:\\Tmp\\File Start.txt" Start_Record="1" Process_All="TRUE" Records="1" Variable="%Line[%x%]%" _COMMENT="This text file contains the index you want to process"/> <IF VARIABLE Variable="%Line[%x%]%" Condition="\x06" Value="," IgnoreCase="FALSE"/> <VARIABLE SET INTEGER Option="\x0E" Destination="%CommaPosition%" Text_Variable="%Line[%x%]%" Text="," Ignore_Case="FALSE"/> <VARIABLE MODIFY INTEGER Option="\x01" Destination="%CommaPosition%" Value1="%CommaPosition%" Value2="1"/> <VARIABLE MODIFY STRING Option="\x09" Destination="%Line[%x%]%" Variable="%Line[%x%]%" Start="1" Count="%CommaPosition%" NoEmbeddedVars="FALSE"/> <END IF/> <VARIABLE MODIFY STRING Option="\x12" Destination="%Line[%x%]%" Filename="C:\\Tmp\\File End.txt" Strip="TRUE" NoEmbeddedVars="FALSE" _COMMENT="This file receives the results"/> <TEXT FILE END PROCESS/> Quote Link to comment Share on other sites More sharing options...
terrypin Posted February 26, 2021 Report Share Posted February 26, 2021 Morning Alan, I like your neat solution approach. But does it remove the header capitals and add a space between sections? That was the tricky part for me, and I ended up unhappily using the GoTo command to get the job done! Terry Quote Link to comment Share on other sites More sharing options...
terrypin Posted February 26, 2021 Report Share Posted February 26, 2021 Back at my PC and I see that the only change needed is trivial, but I'll show the full code for the OP. Variable Set Integer %x% to 1 Text File Begin Process: C:\Tmp\File Start.txt // This text file contains the index you want to process If Variable %Line[%x%]% Contains "," Variable Set Integer %CommaPosition% to the position of "," in %Line[%x%]% Variable Modify Integer: %CommaPosition% = %CommaPosition% - 1 Variable Modify String: Copy part of text in %Line[%x%]% starting at 1 and %CommaPosition% characters long to %Line[%x%]% Variable Modify String: Append %Line[%x%]% to text file, "C:\Tmp\File End.txt" // This file receives the results Else // If line does not contain a comma it must be a header capital. // So it needs to be replaced by a blank line. Variable Modify String: Append %Blank% to text file, "C:\Tmp\File End.txt" End If Text File End Process <VARIABLE SET INTEGER Option="\x00" Destination="%x%" Value="1"/> <TEXT FILE BEGIN PROCESS Filename="C:\\Tmp\\File Start.txt" Start_Record="1" Process_All="TRUE" Records="1" Variable="%Line[%x%]%" _COMMENT="This text file contains the index you want to process"/> <IF VARIABLE Variable="%Line[%x%]%" Condition="\x06" Value="," IgnoreCase="FALSE"/> <VARIABLE SET INTEGER Option="\x0E" Destination="%CommaPosition%" Text_Variable="%Line[%x%]%" Text="," Ignore_Case="FALSE"/> <VARIABLE MODIFY INTEGER Option="\x01" Destination="%CommaPosition%" Value1="%CommaPosition%" Value2="1"/> <VARIABLE MODIFY STRING Option="\x09" Destination="%Line[%x%]%" Variable="%Line[%x%]%" Start="1" Count="%CommaPosition%" NoEmbeddedVars="FALSE"/> <VARIABLE MODIFY STRING Option="\x12" Destination="%Line[%x%]%" Filename="C:\\Tmp\\File End.txt" Strip="TRUE" NoEmbeddedVars="FALSE" _COMMENT="This file receives the results"/> <ELSE/> <COMMENT Value="If line does not contain a comma it must be a header capital."/> <COMMENT Value="So it needs to be replaced by a blank line."/> <VARIABLE MODIFY STRING Option="\x12" Destination="%Blank%" Filename="C:\\Tmp\\File End.txt" Strip="TRUE" NoEmbeddedVars="FALSE"/> <END IF/> <TEXT FILE END PROCESS/> Quote Link to comment Share on other sites More sharing options...
acantor Posted February 26, 2021 Report Share Posted February 26, 2021 You're right, Terry! Much better. Here's my version, which is functionally identical to yours: Variable Set Integer %x% to 1 Text File Begin Process: C:\Tmp\File Start.txt // This is the text file that contains the index you want to process If Variable %Line[%x%]% Contains "," Variable Set Integer %CommaPosition% to the position of "," in %Line[%x%]% Variable Modify Integer: %CommaPosition% = %CommaPosition% - 1 Variable Modify String: Copy part of text in %Line[%x%]% starting at 1 and %CommaPosition% characters long to %Line[%x%]% Else Variable Set String %Line[%x%]% to "" End If Variable Modify String: Append %Line[%x%]% to text file, "C:\Tmp\File End.txt" // This is the file that will receive the results Text File End Process <VARIABLE SET INTEGER Option="\x00" Destination="%x%" Value="1"/> <TEXT FILE BEGIN PROCESS Filename="C:\\Tmp\\File Start.txt" Start_Record="1" Process_All="TRUE" Records="1" Variable="%Line[%x%]%" _COMMENT="This is the text file that contains the index you want to process"/> <IF VARIABLE Variable="%Line[%x%]%" Condition="\x06" Value="," IgnoreCase="FALSE"/> <VARIABLE SET INTEGER Option="\x0E" Destination="%CommaPosition%" Text_Variable="%Line[%x%]%" Text="," Ignore_Case="FALSE"/> <VARIABLE MODIFY INTEGER Option="\x01" Destination="%CommaPosition%" Value1="%CommaPosition%" Value2="1"/> <VARIABLE MODIFY STRING Option="\x09" Destination="%Line[%x%]%" Variable="%Line[%x%]%" Start="1" Count="%CommaPosition%" NoEmbeddedVars="FALSE"/> <ELSE/> <VARIABLE SET STRING Option="\x00" Destination="%Line[%x%]%" NoEmbeddedVars="FALSE"/> <END IF/> <VARIABLE MODIFY STRING Option="\x12" Destination="%Line[%x%]%" Filename="C:\\Tmp\\File End.txt" Strip="TRUE" NoEmbeddedVars="FALSE" _COMMENT="This is the file that will receive the results"/> <TEXT FILE END PROCESS/> 1 Quote Link to comment Share on other sites More sharing options...
dgehman Posted February 26, 2021 Author Report Share Posted February 26, 2021 Amazing posts - thanks again, acantor and Terrypin 12 hours ago, acantor said: Fun project! What's the book about? Actually, it's not a book, but one topic from a Web-base help file as created in Help+Manual, one of the better old school help authoring tools (old school, meaning not browser-based & online, but installed as a program in your local computer). Long story about to follow... sorry... H+M has an input pane outside of the main window, the main one being where you enter the topic text & graphics. This secondary pane is labeled "keywords." When the help file is published as Web-based HTML output, keywords are merged into an accompanying index and search database that allows the viewer to search on terms in the help file. Putting the single-letter rubrics and the page numbers into the Keywords pane would make for really confusing results when a viewer searched on, say, "array" -- the page numbers would be a distraction, a puzzle. The particular topic in my original post is a long and highly complex reference topic for a software function catchall (with many functions detailed) that's part of the development language of my company's rule-based design/mechanical engineering software. There are several similarly long topics that I need to index -- that is, extract the keywords. Doing this manually -- selecting each word and hitting CTRL+k for each) -- gets old and can get confusing, So I bought a copy of PDF Index Generator... publish the long topic as a PDF... run it through PDF Index Generator... output it as a text file. The Macro Express function will be to strip the one-letter rubrics and the page numbers, so I can simply copy the text file and drop it into H+M's Keyword entry pane. Described this way, it begins to look even to me like madness... but it will still be better than manually scouring through some very long topics to mark keywords. Besides, I like to watch computers work. (It might be more sane just to record/write a simple Macro Express function to double-click the word where the cursor is placed, then keyboard (Text Type) CTRL+k. But that still won't stop my mind from wandering or losing track and creating duplicates.) Quote Link to comment Share on other sites More sharing options...
acantor Posted February 26, 2021 Report Share Posted February 26, 2021 1 hour ago, dgehman said: It might be more sane just to record/write a simple Macro Express function to double-click the word where the cursor is placed, then keyboard (Text Type) CTRL+k. Here are a few clues that make make this feasible: 1. In Word, there is no need to select a word before pressing Ctrl + K. By default, the "Insert Hyperlink" acts on the word that has keyboard focus. 2. But there is a bug in how Word defines the boundaries of words. This problem affects many of Word's built in commands, including "Insert Hyperlink." The command doesn't recognize a word if the insertion point (aka the cursor) is parked before the first character in the word or after the last character of the word. For example, the word "this" that follows will not be detected if the cursor is before the first letter or after the last character. I'm indicating insertion points with the "|" Hello |this is a test. [Fails] Hello t|his is a test. [Succeeds] Hello th|is is a test. [Succeeds] Hello thi|s is a test. [Succeeds] Hello this| is a test. [Fails] 3. The workaround is to select the word. But here's a trick. If the cursor is anywhere within the word, press "F8" twice to select it. So your Macro Express script might look something like this: Text Type (Simulate Keystrokes): <ESC> // Exit "select mode" (if it happens to be on) Text Type (Simulate Keystrokes): <F8><F8> // Select the current word by activating "select mode" twice Text Type (Simulate Keystrokes): <CONTROL>k Quote Link to comment Share on other sites More sharing options...
dgehman Posted February 26, 2021 Author Report Share Posted February 26, 2021 Ah - would it were Word. If that makes sense. The authoring environment uses its own home-built word processor, where the sequence really is to select the whole word (double-click), then hit CTRL+k. But I'll be able to use your ideas and post when I'm working with Word. I wasn't aware of the cursor-anywhere-in-word shortcut. Thanks! Quote Link to comment Share on other sites More sharing options...
acantor Posted February 26, 2021 Report Share Posted February 26, 2021 In some text editors, it's possible to automatically select the focused word with this: Mouse Move: To the Text Cursor Position Mouse Left Double Click Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.