How to avoid surrounding html head tags in Jsoup parse

后端 未结 4 1194
别跟我提以往
别跟我提以往 2020-12-06 16:26

Using Jsoup i try to parse the given html content. After Jsoup.parse() the html output append html, head and body tag to the input. I just want to ignore these.

相关标签:
4条回答
  • 2020-12-06 16:52

    You can try using the XML parser, but this doesn't always work because HTML is not always XML; it often has unterminated tags like <img> and <br>. It's better to stick with the HTML parser. You can rely on there being <html>, <head>, and <body> tags and they are easy to discard. Just get your fragment of HTML by selecting the body tag and ask for its HTML.

    Document doc = Jsoup.parseBodyFragment(html);
            doc.outputSettings().prettyPrint(false);
            System.out.println(doc.select("body").html());
    
    0 讨论(0)
  • 2020-12-06 16:53

    The cause:

    parseBodyFragment() as well as all other parse()-methods use a HTML parser by default. And those add always the HTML-Shell (<html>…</html>, <head>…</head> etc.).

    The Solution:

    Just don't use a HTML-parser, use a XML-parser instead ;-)

    Document doc = Jsoup.parse(html, "", Parser.xmlParser());
    

    Replace that single line and your problem is solved.

    Example:

    final String html = "<p><b>This <i>is</i></b> <i>my sentence</i> of text.</p>";
    
    Document docHtml = Jsoup.parse(html);
    Document docXml = Jsoup.parse(html, "", Parser.xmlParser());
    
    System.out.println("******* HTML *******\n" + docHtml);
    System.out.println();
    System.out.println("*******  XML *******\n" + docXml);
    

    Output:

    ******* HTML *******
    <html>
     <head></head>
     <body>
      <p><b>This <i>is</i></b> <i>my sentence</i> of text.</p>
     </body>
    </html>
    
    *******  XML *******
    <p><b>This <i>is</i></b> <i>my sentence</i> of text.</p>
    
    0 讨论(0)
  • 2020-12-06 17:03

    To get the expected output it would actually be:

    final String html = "<p><b>This <i>is</i></b> <i>my sentence</i> of text.</p>";
    Document doc = Jsoup.parseBodyFragment(html);
    doc.outputSettings().prettyPrint(false);
    
    System.out.println(doc.body().html());
    
    0 讨论(0)
  • 2020-12-06 17:12

    You can use Jsoup.parse also with HTML parser. All you need to do is to strip the html and body wrappers away.

    This can be done by selecting the body element and unwrapping it:

    String input = "<p><b>This <i>is</i></b> <i>my sentence</i> of text.</p>";
    Node content = Jsoup.parse(input).body().unwrap();
    System.out.println(content.html());
    

    By body() you select body element and by unwrap() you remove body and only content remains.

    So output is:

    <p><b>This <i>is</i></b> <i>my sentence</i> of text.</p>
    
    0 讨论(0)
提交回复
热议问题