Using Jsoup i try to parse the given html content. After Jsoup.parse() the html output append html, head and body tag to the input. I just want to ignore these.
parseBodyFragment() as well as all other parse()-methods use a HTML parser by default. And those add always the HTML-Shell (…, … etc.).
Just don't use a HTML-parser, use a XML-parser instead ;-)
Document doc = Jsoup.parse(html, "", Parser.xmlParser());
Replace that single line and your problem is solved.
final String html = "This is my sentence of text.
";
Document docHtml = Jsoup.parse(html);
Document docXml = Jsoup.parse(html, "", Parser.xmlParser());
System.out.println("******* HTML *******\n" + docHtml);
System.out.println();
System.out.println("******* XML *******\n" + docXml);
Output:
******* HTML *******
This is my sentence of text.
******* XML *******
This is my sentence of text.