Jump to content
Macro Express Forums
lemming

Extract All Links From A Google Search

Recommended Posts

This macro will:

1. Load the source html of a Google search result page,

2. extract all urls from the html,

3. filter out Google-related links (e.g. About Google, Advanced search, cached pages, etc),

4. and display a list of urls it managed to extract.

 

This macro only works with Firefox, and I've set the scope for only Google search result pages. You can easily change this, of course.

 

I'd stopped using IE months ago, so I'll leave it to the holdouts to modify this script for IE or other browsers. Should be really trivial though.

 

This script has no error-checking, and may explode if it hits bad HTML. However, this is usually not a problem for machine-generated html code like Google's results. You might also need to tweak the delays for slower machines.

 

To use:

 

1. Search for anything in Google, using Mozilla Firefox

2. Wait for page to completely finish loading.

3. Hit Ctrl-Tab

4. URL extraction results should be displayed in under 2 seconds.

 

The url extraction works by finding pairs of href="http and "> , which indicate a clickable link for example,

<a href="http://www.macros.com/">

 

-Lemming

 

Clear Text Variables: All
Clipboard Empty
// Define CR/LF
Variable Set %T95% to ASCII Char of 13
Variable Set %T96% to ASCII Char of 10
Variable Set String %T95% "%T95%%T96%"
// obtain source code
Text Type: <CONTROL>u
Wait For Window Title: "view-source"
Delay 20 Milliseconds
Text Type: <CONTROL>a
Delay 250 Milliseconds
Clipboard Copy
Delay 100 Milliseconds
Variable Set String %T1% from Clipboard
Window Close: "view-source"
Delay 10 Milliseconds
// Process urls
Variable Set String %T99% "CONTINUE"
Repeat Until %T99% = "STOP"
   Variable Set String %T98% "NORMAL LINK"
 // Look for  href="http
 Variable Set Integer %N1% from Position of Text in Variable %T1%
 If Variable %N1% = 0
   Variable Set String %T99% "STOP"
 End If
 // Delete everything up till first http
 Variable Modify Integer: %N1% = %N1% + 5
 Variable Modify String: Delete Part of %T1%
 // Look for ">
 Variable Set Integer %N2% from Position of Text in Variable %T1%
 // calc length of url
 Variable Modify Integer: %N3% = %N2% - 1
 // copy url
 Variable Modify String: Copy Part of %T1% to %T2%
 // Filter out Google links
 If Variable %T2% contains "google.com"
   // Indicate this is a Google link
   Variable Set String %T98% "GOOGLE LINK"
 End If
 If Variable %T2% contains "q=cache"
   // Indicate this is a Google link
   Variable Set String %T98% "GOOGLE LINK"
 End If
 If Variable %T98% = "GOOGLE LINK"
   // don't add to list
 Else
   // append url to T3, with CRLF
   Variable Set String %T3% "%T3%%T2%%T95%"
 End If
Repeat End
// display results
Text Box Display: URLs obtained from Google

parse_links_v0.2.mex

Share this post


Link to post
Share on other sites
This macro will:

1. Load the source html of a Google search result page,

2. extract all urls from the html,

3. filter out Google-related links (e.g. About Google, Advanced search, cached pages, etc),

4. and display a list of urls it managed to extract.

 

This macro only works with Firefox, and I've set the scope for only Google search result pages. You can easily change this, of course.

 

I'd stopped using IE months ago, so I'll leave it to the holdouts to modify this script for IE or other browsers. Should be really trivial though.

 

This script has no error-checking, and may explode if it hits bad HTML. However, this is usually not a problem for machine-generated html code like Google's results. You might also need to tweak the delays for slower machines.

 

To use:

 

1. Search for anything in Google, using Mozilla Firefox

2. Wait for page to completely finish loading.

3. Hit Ctrl-Tab

4. URL extraction results should be displayed in under 2 seconds.

 

The url extraction works by finding pairs of href="http and "> , which indicate a clickable link for example,

<a href="http://www.macros.com/">

 

-Lemming

 

Clear Text Variables: All
Clipboard Empty
// Define CR/LF
Variable Set %T95% to ASCII Char of 13
Variable Set %T96% to ASCII Char of 10
Variable Set String %T95% "%T95%%T96%"
// obtain source code
Text Type: <CONTROL>u
Wait For Window Title: "view-source"
Delay 20 Milliseconds
Text Type: <CONTROL>a
Delay 250 Milliseconds
Clipboard Copy
Delay 100 Milliseconds
Variable Set String %T1% from Clipboard
Window Close: "view-source"
Delay 10 Milliseconds
// Process urls
Variable Set String %T99% "CONTINUE"
Repeat Until %T99% = "STOP"
   Variable Set String %T98% "NORMAL LINK"
 // Look for  href="http
 Variable Set Integer %N1% from Position of Text in Variable %T1%
 If Variable %N1% = 0
   Variable Set String %T99% "STOP"
 End If
 // Delete everything up till first http
 Variable Modify Integer: %N1% = %N1% + 5
 Variable Modify String: Delete Part of %T1%
 // Look for ">
 Variable Set Integer %N2% from Position of Text in Variable %T1%
 // calc length of url
 Variable Modify Integer: %N3% = %N2% - 1
 // copy url
 Variable Modify String: Copy Part of %T1% to %T2%
 // Filter out Google links
 If Variable %T2% contains "google.com"
   // Indicate this is a Google link
   Variable Set String %T98% "GOOGLE LINK"
 End If
 If Variable %T2% contains "q=cache"
   // Indicate this is a Google link
   Variable Set String %T98% "GOOGLE LINK"
 End If
 If Variable %T98% = "GOOGLE LINK"
   // don't add to list
 Else
   // append url to T3, with CRLF
   Variable Set String %T3% "%T3%%T2%%T95%"
 End If
Repeat End
// display results
Text Box Display: URLs obtained from Google

sorry but how do you use this, i need it and am new to programming

Share this post


Link to post
Share on other sites

Using a downlaoder app to dump a raw HTML file to disk works well too. Then one just has to parse the file.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×