How to search for comments (“<!— -->”) using Jsoup?

前端 未结 4 1128
栀梦
栀梦 2020-12-31 07:03

I would like to remove those tags with their content from source HTML.

4条回答
  •  天涯浪人
    2020-12-31 07:28

    This is a variation of the first example using a functional programming approach. The easiest way to find all comments, which are immediate children of the current node is to use .filter() on a stream of .childNodes()

    public void removeComments(Element e) {
        e.childNodes().stream()
            .filter(n -> n.nodeName().equals("#comment")).collect(Collectors.toList())
            .forEach(n -> n.remove());
        e.children().forEach(elem -> removeComments(elem));
    }
    

    Full example:

    package demo;
    
    import java.io.File;
    import java.io.FileOutputStream;
    import java.io.IOException;
    import java.io.PrintStream;
    import java.net.MalformedURLException;
    import java.net.URL;
    import java.util.stream.Collectors;
    
    import org.jsoup.Jsoup;
    import org.jsoup.nodes.Document;
    import org.jsoup.nodes.Element;
    public class Demo {
    
    public static void removeComments(Element e) {
        e.childNodes().stream()
            .filter(n -> n.nodeName().equals("#comment")).collect(Collectors.toList())
            .forEach(n -> n.remove());
        e.children().forEach(elem -> removeComments(elem));
    }
    
    public static void main(String[] args) throws MalformedURLException, IOException {
        Document doc = Jsoup.parse(new URL("https://en.wikipedia.org/"), 500);
    
        // do not try this with JDK < 8
        String userHome = System.getProperty("user.home");
        PrintStream out = new PrintStream(new FileOutputStream(userHome + File.separator + "before.html"));
        out.print(doc.outerHtml());
        out.close();
    
        removeComments(doc);
        out = new PrintStream(new FileOutputStream(userHome + File.separator + "after.html"));
        out.print(doc.outerHtml());
        out.close();
    }
    

    }

提交回复
热议问题