Jsoup - extracting text

后端 未结 4 1903
日久生厌
日久生厌 2020-12-18 01:10

I need to extract text from a node like this:

Some text with tags might go here.

Also there are paragraphs<

相关标签:
4条回答
  • 2020-12-18 01:19

    Assuming you want text only (no tags) my solution is below.
    Output is:
    Some text with tags might go here. Also there are paragraphs. More text can go without paragraphs

    public static void main(String[] args) throws IOException {
        String str = 
                    "<div>"  
                +   "    Some text <b>with tags</b> might go here."
                +   "    <p>Also there are paragraphs.</p>"
                +   "    More text can go without paragraphs<br/>" 
                +   "</div>";
    
        Document doc = Jsoup.parse(str);
        Element div = doc.select("div").first();
        StringBuilder builder = new StringBuilder();
        stripTags(builder, div.childNodes());
        System.out.println("Text without tags: " + builder.toString());
    }
    
    /**
     * Strip tags from a List of type <code>Node</code>
     * @param builder StringBuilder : input and output
     * @param nodesList List of type <code>Node</code>
     */
    public static void stripTags (StringBuilder builder, List<Node> nodesList) {
    
        for (Node node : nodesList) {
            String nodeName  = node.nodeName();
    
            if (nodeName.equalsIgnoreCase("#text")) {
                builder.append(node.toString());
            } else {
                // recurse
                stripTags(builder, node.childNodes());
            }
        }
    }
    
    0 讨论(0)
  • 2020-12-18 01:28

    you can use TextNode for this purpose:

    List<TextNode> bodyTextNode = doc.getElementById("content").textNodes();
        String html = "";
        for(TextNode txNode:bodyTextNode){
            html+=txNode.text();
        }
    
    0 讨论(0)
  • 2020-12-18 01:34

    Element.children() returns an Elements object - a list of Element objects. Looking at the parent class, Node, you'll see methods to give you access to arbitrary nodes, not just Elements, such as Node.childNodes().

    public static void main(String[] args) throws IOException {
        String str = "<div>" +
                "    Some text <b>with tags</b> might go here." +
                "    <p>Also there are paragraphs</p>" +
                "    More text can go without paragraphs<br/>" +
                "</div>";
    
        Document doc = Jsoup.parse(str);
        Element div = doc.select("div").first();
        int i = 0;
    
        for (Node node : div.childNodes()) {
            i++;
            System.out.println(String.format("%d %s %s",
                    i,
                    node.getClass().getSimpleName(),
                    node.toString()));
        }
    }
    

    Result:

    1 TextNode 
     Some text 
    2 Element <b>with tags</b>
    3 TextNode  might go here. 
    4 Element <p>Also there are paragraphs</p>
    5 TextNode  More text can go without paragraphs
    6 Element <br/>
    
    0 讨论(0)
  • for (Element el : doc.select("body").select("*")) {
    
            for (TextNode node : el.textNodes()) {
    
                        node.text() ));
    
            }
    
        }
    
    0 讨论(0)
提交回复
热议问题