extract loosly structured wikipedia text. html

问题

Some of the html on wikipedia disambiguation pages is, shall we say, ambiguous, i.e. the links there that connect to specific persons named Corzine are difficult to capture using jsoup because they're not explicitly structured, nor do they live in a particular section as in this example. See the page Corzine page here.

How can I get a hold of them? Is jsoup a suitable tool for this task?

Perhaps I should use regex, but I fear doing that because I want it to be generalizable.

</b> may refer to:</p> 
 <ul> 
  <li><a href

^this here is standard, maybe I could use regex to match that?

<p><b>Corzine</b> may refer to:</p> 
 <ul> 
  <li><a href="/wiki/Dave_Corzine" title="Dave Corzine">Dave Corzine</a> (born 1956), basketball player</li> 
  <li><a href="/wiki/Jon_Corzine" title="Jon Corzine">Jon Corzine</a> (born 1947), former CEO of <a href="/wiki/MF_Global" title="MF Global">MF Global</a>, former Governor on New Jersey, former CEO of <a href="/wiki/Goldman_Sachs" title="Goldman Sachs">Goldman Sachs</a></li> 
 </ul> 
 <table id="setindexbox" class="metadata plainlinks dmbox dmbox-setindex" style="" role="presentation">

The ideal output would be

Dave Corzine
Jon Corzine

Maybe it would be possible to match the section </b> may refer to:</p> and also <table id="setindexbox" and extract all that's in between. I guess <table id="setindexbox" could be matched easily enough in jsoup, but </b> may refer to:</p> should be more difficule because <b> or <p> are not very distinguished.

I tried this:

      Elements table = docx.select("ul");
      Elements links = table.select("li");



    Pattern ppp = Pattern.compile("table id=\"setindexbox\" ");
    Matcher mmm = ppp.matcher(inputLine);

    Pattern pp = Pattern.compile("</b> may refer to:</p>");
    Matcher mm = pp.matcher(inputLine);
    if (mm.matches()) 
    {
    while(!mmm.matches())
      for (Element link: links) 
      {
          String url = link.attr("href");
          String text = link.text();
          System.out.println(text + ", " + url);
      }
    }

but it didn't work.

回答1:

This selector works:

Elements els = doc.select("p ~ ul a:eq(0)");

See: http://try.jsoup.org/~yPvgR0pxvA3oWQSJte4Rfm-lS2Y

That's looking for the first A element (a:eq(0)) in a ul that's a sibling of a p. You could also do p:contains(corzine) ~ ul a:eq(0) if there were other conflicts.

Or perhaps more generally: :contains(may refer to) ~ ul a:eq(0)

It's hard to generalize Wikipedia because it's unstructured. But IMHO it's easier to use a parser and CSS selectors than regexes, particularly over time when templates change etc.

来源：https://stackoverflow.com/questions/29811974/extract-loosly-structured-wikipedia-text-html

标签

html

regex

parsing

jsoup

wikipedia