Jsoup css selector code (xpath code included)

痞子三分冷 提交于 2019-11-29 07:57:32

It really looks like Jsoup can't handle getting text out of an element with mixed content. Here is a solution that uses the XPath you formulated that uses XOM and TagSoup:

import java.io.IOException;

import nu.xom.Builder;
import nu.xom.Document;
import nu.xom.Nodes;
import nu.xom.ParsingException;
import nu.xom.ValidityException;
import nu.xom.XPathContext;

import org.ccil.cowan.tagsoup.Parser;
import org.xml.sax.SAXException;

public class HtmlTest {
    public static void main(final String[] args) throws SAXException, ValidityException, ParsingException, IOException {
        final String html = "<div class=\"info\"><strong>Line 1:</strong> some text 1<br><b>some text 2</b><br><strong>Line 3:</strong> some text 3<br></div>";
        final Parser parser = new Parser();
        final Builder builder = new Builder(parser);
        final Document document = builder.build(html, null);
        final nu.xom.Element root = document.getRootElement();
        final Nodes textElements = root.query("//xhtml:div[@class='info']/xhtml:strong[1]/following::text()", new XPathContext("xhtml", root.getNamespaceURI()));
        for (int textNumber = 0; textNumber < textElements.size(); ++textNumber) {
            System.out.println(textElements.get(textNumber).toXML());
        }
    }
}

This outputs:

 some text 1
some text 2
Line 3:
 some text 3

Without knowing more specifics of what you're trying to do though, I'm not sure if this is exactly what you want.

Your problem I think is that of the text you're interested in, only one phrase is enclosed within any defining tags, "some text 2" which is enclosed by <b> </b> tags. So this is easily obtainable via:

String text2 = doc.select("div.info b").text();

which returns

some text 2

The other texts of interest can only be defined as text held within your <div class="info"> tag, and that's it. So the only way that I know of to get this is to get all the text held by this larger element:

String text1 = doc.select("div.info").text();

But unfortunately, this gets all the text held by this element:

Line 1: some text 1 some text 2 Line 3: some text 3

That's about the best I can do, and I'm hoping someone can find a better answer and will keep following this question.

It is possible to get an object reference to individual TextNodes. I think maybe you over looked Jsoup's TextNode Object.

The text at the top level of an Element is an instance of a TextNode Object. For instance, " some text 1" and " some text 3" are both TextNode Objects under "< div class='info' >" and "Line 1:" is a TextNode Object under "< strong >"

Element Objects have a textNodes() method which will be of use for you to get a hold of these TextNode Objects.

Check the following code:

String html = "<html>" +
                  "<body>" +
                      "<div class="info">" +
                          "<strong>Line 1:</strong> some text 1<br>" +
                          "<b>some text 2</b><br>" +
                          "<strong>Line 3:</strong> some text 3<br>" +
                      "</div>" +
                  "</body>" +
              "</html>";

Document document = JSoup.parse(html);
Element infoDiv = document.select("div.info").first();
List<TextNode> infoDivTextNodes = infoDiv.textNodes();

This code finds the first < div > Element who has an Attribute with key="class" and value="info". Then get a reference to all of the TextNode Objects directly under "< div class='info' >". That list looks like:

List<TextNode>[" some text 1", " some text 3"]

TextNode Objects have some sweet data and methods associated with them which you can utilize, and extends Node giving you even more functionality to utilize.

The following is an example of getting object references for each TextNode inside div's with class="info".

for(Iterator<Element> elementIt = document.select("div.info").iterator(); elementIt.hasNext();){
    Element element = elementIt.next();

    for (Iterator<TextNode> textIt = element.textNodes().iterator(); textIt.hasNext();) {
        TextNode textNode = textIt.next();
        //Do your magic with textNode now.
        //You can even reference it's parent via the inherited Node Object's 
        //method .parent();
    }
}

Using this nested iterator technique you can access all the text nodes of an object and with some clever logic you can just about do anything you want within Jsoup's structure.

I have implemented this logic for a spell checking method I have created in the past and it does have some performance hits on very large html documents with a high number of elements, perhaps a lot of lists or something. But if your files are reasonable in length, you should get sufficient performance.

The following is an example of getting object references for each TextNode of a Document.

Document document = Jsoup.parse(html);

for (Iterator<Element> elementIt = document.body().getAllElements().iterator(); elementIt.hasNext();) {
    Element element = elementIt.next();
    //Maybe some magic for each element..

    for (Iterator<TextNode> textIt = element.textNodes().iterator(); textIt.hasNext();) {
        TextNode textNode = textIt.next();
        //Lots of magic here for each textNode..
    }
}
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!