Jsoup css selector code (xpath code included)

问题

I am trying to parse below HTML using jsoup but not able to get the right syntax for it.

<div class="info"><strong>Line 1:</strong> some text 1<br>
  <b>some text 2</b><br>
  <strong>Line 3:</strong> some text 3<br>
</div>

I need to capture some text 1, some text 2 and some text 3 in three different variables.

I have the xpath for first line (which should be similar for line 3) but unable to work out the equivalent css selector.

//div[@class='info']/strong[1]/following::text()

Please help.

On a separate I have few hundred html files and need to parse and extract data from them to store in a database. Is Jsoup best choice for this?

I am trying to re-open this question as I still haven't found the solution. Please help.

回答1:

It really looks like Jsoup can't handle getting text out of an element with mixed content. Here is a solution that uses the XPath you formulated that uses XOM and TagSoup:

import java.io.IOException;

import nu.xom.Builder;
import nu.xom.Document;
import nu.xom.Nodes;
import nu.xom.ParsingException;
import nu.xom.ValidityException;
import nu.xom.XPathContext;

import org.ccil.cowan.tagsoup.Parser;
import org.xml.sax.SAXException;

public class HtmlTest {
    public static void main(final String[] args) throws SAXException, ValidityException, ParsingException, IOException {
        final String html = "<div class=\"info\"><strong>Line 1:</strong> some text 1<br><b>some text 2</b><br><strong>Line 3:</strong> some text 3<br></div>";
        final Parser parser = new Parser();
        final Builder builder = new Builder(parser);
        final Document document = builder.build(html, null);
        final nu.xom.Element root = document.getRootElement();
        final Nodes textElements = root.query("//xhtml:div[@class='info']/xhtml:strong[1]/following::text()", new XPathContext("xhtml", root.getNamespaceURI()));
        for (int textNumber = 0; textNumber < textElements.size(); ++textNumber) {
            System.out.println(textElements.get(textNumber).toXML());
        }
    }
}

This outputs:

 some text 1
some text 2
Line 3:
 some text 3

Without knowing more specifics of what you're trying to do though, I'm not sure if this is exactly what you want.

回答2:

Your problem I think is that of the text you're interested in, only one phrase is enclosed within any defining tags, "some text 2" which is enclosed by <b> </b> tags. So this is easily obtainable via:

String text2 = doc.select("div.info b").text();

which returns

some text 2

The other texts of interest can only be defined as text held within your <div class="info"> tag, and that's it. So the only way that I know of to get this is to get all the text held by this larger element:

String text1 = doc.select("div.info").text();

But unfortunately, this gets all the text held by this element:

Line 1: some text 1 some text 2 Line 3: some text 3

That's about the best I can do, and I'm hoping someone can find a better answer and will keep following this question.

回答3:

It is possible to get an object reference to individual TextNodes. I think maybe you over looked Jsoup's TextNode Object.

The text at the top level of an Element is an instance of a TextNode Object. For instance, " some text 1" and " some text 3" are both TextNode Objects under "< div class='info' >" and "Line 1:" is a TextNode Object under "< strong >"

Element Objects have a textNodes() method which will be of use for you to get a hold of these TextNode Objects.

Check the following code:

String html = "<html>" +
                  "<body>" +
                      "<div class="info">" +
                          "<strong>Line 1:</strong> some text 1<br>" +
                          "<b>some text 2</b><br>" +
                          "<strong>Line 3:</strong> some text 3<br>" +
                      "</div>" +
                  "</body>" +
              "</html>";

Document document = JSoup.parse(html);
Element infoDiv = document.select("div.info").first();
List<TextNode> infoDivTextNodes = infoDiv.textNodes();

This code finds the first < div > Element who has an Attribute with key="class" and value="info". Then get a reference to all of the TextNode Objects directly under "< div class='info' >". That list looks like:

List<TextNode>[" some text 1", " some text 3"]

TextNode Objects have some sweet data and methods associated with them which you can utilize, and extends Node giving you even more functionality to utilize.

The following is an example of getting object references for each TextNode inside div's with class="info".

for(Iterator<Element> elementIt = document.select("div.info").iterator(); elementIt.hasNext();){
    Element element = elementIt.next();

    for (Iterator<TextNode> textIt = element.textNodes().iterator(); textIt.hasNext();) {
        TextNode textNode = textIt.next();
        //Do your magic with textNode now.
        //You can even reference it's parent via the inherited Node Object's 
        //method .parent();
    }
}

Using this nested iterator technique you can access all the text nodes of an object and with some clever logic you can just about do anything you want within Jsoup's structure.

I have implemented this logic for a spell checking method I have created in the past and it does have some performance hits on very large html documents with a high number of elements, perhaps a lot of lists or something. But if your files are reasonable in length, you should get sufficient performance.

The following is an example of getting object references for each TextNode of a Document.

Document document = Jsoup.parse(html);

for (Iterator<Element> elementIt = document.body().getAllElements().iterator(); elementIt.hasNext();) {
    Element element = elementIt.next();
    //Maybe some magic for each element..

    for (Iterator<TextNode> textIt = element.textNodes().iterator(); textIt.hasNext();) {
        TextNode textNode = textIt.next();
        //Lots of magic here for each textNode..
    }
}

来源：https://stackoverflow.com/questions/11816878/jsoup-css-selector-code-xpath-code-included

标签

xpath

css-selectors

html-parsing

jsoup

tag-soup