How to search for comments (“<!— -->”) using Jsoup?

前端未结

关注

 4  1128

栀梦 2020-12-31 07:03

I would like to remove those tags with their content from source HTML.

4条回答

天涯浪人 (楼主)

2020-12-31 07:28

This is a variation of the first example using a functional programming approach. The easiest way to find all comments, which are immediate children of the current node is to use .filter() on a stream of .childNodes()

public void removeComments(Element e) {
    e.childNodes().stream()
        .filter(n -> n.nodeName().equals("#comment")).collect(Collectors.toList())
        .forEach(n -> n.remove());
    e.children().forEach(elem -> removeComments(elem));
}

Full example:

package demo;

import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.PrintStream;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.stream.Collectors;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
public class Demo {

public static void removeComments(Element e) {
    e.childNodes().stream()
        .filter(n -> n.nodeName().equals("#comment")).collect(Collectors.toList())
        .forEach(n -> n.remove());
    e.children().forEach(elem -> removeComments(elem));
}

public static void main(String[] args) throws MalformedURLException, IOException {
    Document doc = Jsoup.parse(new URL("https://en.wikipedia.org/"), 500);

    // do not try this with JDK < 8
    String userHome = System.getProperty("user.home");
    PrintStream out = new PrintStream(new FileOutputStream(userHome + File.separator + "before.html"));
    out.print(doc.outerHtml());
    out.close();

    removeComments(doc);
    out = new PrintStream(new FileOutputStream(userHome + File.separator + "after.html"));
    out.print(doc.outerHtml());
    out.close();
}

}

0 讨论(0)

查看其它4个回答