What regex can I use to extract URLs from a Google search?

问题

I'm using Delphi with the JCLRegEx and want to capture all the result URL's from a google search. I looked at HackingSearch.com and they have an example RegEx that looks right, but I cannot get any results when I try it.

I'm using it similar to:

Var re:JVCLRegEx;
    I:Integer; 
Begin
  re := TJclRegEx.Create;

  With re do try
    Compile('class="?r"?>.+?href="(.+?)".*?>(.+?)<\/a>.+?class="?s"?>(.+?)<cite>.+?class="?gl"?><a href="(.+?)"><\/div><[li|\/ol]',false,false);  
    If match(memo1.lines.text) then begin
      For I := 0 to captureCount -1 do
        memo2.lines.add(captures[1]);
    end;
  finally free;
  end;
  freeandnil(re);
end;

Regex is available at hackingsearch.com

I'm using the Delphi Jedi version, since everytime I install TPerlRegEx I get a conflict with the two...

回答1:

Offtopic: You can try Google AJAX Search API: http://code.google.com/apis/ajaxsearch/documentation/

回答2:

Below is a relevant section from Google search results for the term python tuple. (I modified it to fit the screen here by adding new lines here and there, but I tested your regex on the raw string obtained from Google's source as revealed by Firebug). Your regex gave no matches for this string.

<li class="g w0">
  <h3 class="r">
    <a onmousedown="return rwt(this,'','','res','2','AFQjCNG5WXSP8xy6BkJFyA2Emg8JrFW2_g','&amp;sig2=4MpG_Ib3MrwYmIG6DbZjSg','0CBUQFjAB')" 
      class="l" href="http://www.korokithakis.net/tutorials/python">Learn <em>Python</em> in 10 minutes | Stavros's Stuff</a>
  </h3>
  <span style="display: inline-block;">
    <button class="w10">
    </button>
    <button class="w20">
    </button>
  </span>
  <span class="m">&nbsp;<span dir="ltr">- 2 visits</span>&nbsp;<span dir="ltr">- Jan 21</span></span>
  <div class="s">
  The data structures available in <em>python</em> are lists, <em>tuples</em>
   and dictionaries. Sets are available in the sets library (but are built-in in <em>
  Python</em> 2.5 and <b>...</b><br>
  <cite>
    www.korokithakis.net/tutorials/<b>
    python</b>
     - 
  </cite>
  <span class="gl">
    <a onmousedown="return rwt(this,'','','clnk','2','AFQjCNFVaSJCprC5enuMZ9Nt7OZ8VzDkMg','&amp;sig2=4qxw5AldSTW70S01iulYeA')" 
      href="http://74.125.153.132/search?q=cache:oeYpHokMeBAJ:www.korokithakis.net/tutorials/python+python+tuple&amp;cd=2&amp;hl=en&amp;ct=clnk&amp;client=firefox-a">
      Cached
    </a>
     - <button title="Comment" class="wci">
    </button>
    <button class="w4" title="Promote">
    </button>
    <button class="w5" title="Remove">
    </button>
  </span>
  </div>
  <div class="wce">
  </div>
  <!--n-->
  <!--m-->
</li>

FWIW, I guess one of the many reasons is that there is no <Va> in this result at all. I copied the full html source from Firebug and tried to match it with your regex - didn't get any match at all.

Google might change the way they display the results from time to time - at a given time, it can vary depending on factors like your logged in status, web history etc. The particular regex you came up with might be working for you for now, but in the long run it will become difficult to maintain. People suggest using html parser instead of giving a regex because they know that the solution won't be stable.

回答3:

If you need to debug regular expressions in any language you need to look at RegExBuddy, its not free but it will pay for itself in a day.

回答4:

class=r?>.+?href="(.+?)".*?>(.+?)<\/a>.+?class="?s"?>(.+?)<cite>.+?class="?gl"?>

works for now.

来源：https://stackoverflow.com/questions/2122757/what-regex-can-i-use-to-extract-urls-from-a-google-search

标签

regex

Delphi

html-parsing

jvcl