How to avoid surrounding html head tags in Jsoup parse

后端 未结 4 1240
别跟我提以往
别跟我提以往 2020-12-06 16:26

Using Jsoup i try to parse the given html content. After Jsoup.parse() the html output append html, head and body tag to the input. I just want to ignore these.

4条回答
  •  谎友^
    谎友^ (楼主)
    2020-12-06 16:53

    The cause:

    parseBodyFragment() as well as all other parse()-methods use a HTML parser by default. And those add always the HTML-Shell (, etc.).

    The Solution:

    Just don't use a HTML-parser, use a XML-parser instead ;-)

    Document doc = Jsoup.parse(html, "", Parser.xmlParser());
    

    Replace that single line and your problem is solved.

    Example:

    final String html = "

    This is my sentence of text.

    "; Document docHtml = Jsoup.parse(html); Document docXml = Jsoup.parse(html, "", Parser.xmlParser()); System.out.println("******* HTML *******\n" + docHtml); System.out.println(); System.out.println("******* XML *******\n" + docXml);

    Output:

    ******* HTML *******
    
     
     
      

    This is my sentence of text.

    ******* XML *******

    This is my sentence of text.

提交回复
热议问题