How to search for comments (“<!— -->”) using Jsoup?

前端 未结 4 1125
栀梦
栀梦 2020-12-31 07:03

I would like to remove those tags with their content from source HTML.

相关标签:
4条回答
  • 2020-12-31 07:25

    With JSoup 1.11+ (possibly older version) you can apply a filter:

    private void removeComments(Element article) {
        article.filter(new NodeFilter() {
            @Override
            public FilterResult tail(Node node, int depth) {
                if (node instanceof Comment) {
                    return FilterResult.REMOVE;
                }
                return FilterResult.CONTINUE;
            }
    
            @Override
            public FilterResult head(Node node, int depth) {
                if (node instanceof Comment) {
                    return FilterResult.REMOVE;
                }
                return FilterResult.CONTINUE;
            }
        });
    }
    
    0 讨论(0)
  • 2020-12-31 07:28

    This is a variation of the first example using a functional programming approach. The easiest way to find all comments, which are immediate children of the current node is to use .filter() on a stream of .childNodes()

    public void removeComments(Element e) {
        e.childNodes().stream()
            .filter(n -> n.nodeName().equals("#comment")).collect(Collectors.toList())
            .forEach(n -> n.remove());
        e.children().forEach(elem -> removeComments(elem));
    }
    

    Full example:

    package demo;
    
    import java.io.File;
    import java.io.FileOutputStream;
    import java.io.IOException;
    import java.io.PrintStream;
    import java.net.MalformedURLException;
    import java.net.URL;
    import java.util.stream.Collectors;
    
    import org.jsoup.Jsoup;
    import org.jsoup.nodes.Document;
    import org.jsoup.nodes.Element;
    public class Demo {
    
    public static void removeComments(Element e) {
        e.childNodes().stream()
            .filter(n -> n.nodeName().equals("#comment")).collect(Collectors.toList())
            .forEach(n -> n.remove());
        e.children().forEach(elem -> removeComments(elem));
    }
    
    public static void main(String[] args) throws MalformedURLException, IOException {
        Document doc = Jsoup.parse(new URL("https://en.wikipedia.org/"), 500);
    
        // do not try this with JDK < 8
        String userHome = System.getProperty("user.home");
        PrintStream out = new PrintStream(new FileOutputStream(userHome + File.separator + "before.html"));
        out.print(doc.outerHtml());
        out.close();
    
        removeComments(doc);
        out = new PrintStream(new FileOutputStream(userHome + File.separator + "after.html"));
        out.print(doc.outerHtml());
        out.close();
    }
    

    }

    0 讨论(0)
  • 2020-12-31 07:41

    When searching you basically use Elements.select(selector) where selector is defined by this API. However comments are not elements technically, so you may be confused here, still they are nodes identified by the node name #comment.

    Let's see how that might work:

    import org.jsoup.Jsoup;
    import org.jsoup.nodes.Document;
    import org.jsoup.nodes.Node;
    
    public class RemoveComments {
        public static void main(String... args) {
            String h = "<html><head></head><body>" +
              "<div><!-- foo --><p>bar<!-- baz --></div><!--qux--></body></html>";
            Document doc = Jsoup.parse(h);
            removeComments(doc);
            doc.html(System.out);
        }
    
        private static void removeComments(Node node) {
            for (int i = 0; i < node.childNodeSize();) {
                Node child = node.childNode(i);
                if (child.nodeName().equals("#comment"))
                    child.remove();
                else {
                    removeComments(child);
                    i++;
                }
            }
        }        
    }
    
    0 讨论(0)
  • 2020-12-31 07:42

    reference @dlamblin https://stackoverflow.com/a/7541875/4712855 this code get comment html

    public static void getHtmlComments(Node node) {
        for (int i = 0; i < node.childNodeSize();i++) {
            Node child = node.childNode(i);
            if (child.nodeName().equals("#comment")) {
                Comment comment = (Comment) child;
                child.after(comment.getData());
                child.remove();
            }
            else {
                getHtmlComments(child);
            }
        }
    }
    
    0 讨论(0)
提交回复
热议问题