Using Jsoup i try to parse the given html content. After Jsoup.parse() the html output append html, head and body tag to the input. I just want to ignore these.
You can try using the XML parser, but this doesn't always work because HTML is not always XML; it often has unterminated tags like <img>
and <br>
. It's better to stick with the HTML parser. You can rely on there being <html>
, <head>
, and <body>
tags and they are easy to discard. Just get your fragment of HTML by selecting the body tag and ask for its HTML.
Document doc = Jsoup.parseBodyFragment(html);
doc.outputSettings().prettyPrint(false);
System.out.println(doc.select("body").html());
parseBodyFragment()
as well as all other parse()
-methods use a HTML parser by default. And those add always the HTML-Shell (<html>…</html>
, <head>…</head>
etc.).
Just don't use a HTML-parser, use a XML-parser instead ;-)
Document doc = Jsoup.parse(html, "", Parser.xmlParser());
Replace that single line and your problem is solved.
final String html = "<p><b>This <i>is</i></b> <i>my sentence</i> of text.</p>";
Document docHtml = Jsoup.parse(html);
Document docXml = Jsoup.parse(html, "", Parser.xmlParser());
System.out.println("******* HTML *******\n" + docHtml);
System.out.println();
System.out.println("******* XML *******\n" + docXml);
Output:
******* HTML *******
<html>
<head></head>
<body>
<p><b>This <i>is</i></b> <i>my sentence</i> of text.</p>
</body>
</html>
******* XML *******
<p><b>This <i>is</i></b> <i>my sentence</i> of text.</p>
To get the expected output it would actually be:
final String html = "<p><b>This <i>is</i></b> <i>my sentence</i> of text.</p>";
Document doc = Jsoup.parseBodyFragment(html);
doc.outputSettings().prettyPrint(false);
System.out.println(doc.body().html());
You can use Jsoup.parse also with HTML parser. All you need to do is to strip the html
and body
wrappers away.
This can be done by selecting the body
element and unwrapping it:
String input = "<p><b>This <i>is</i></b> <i>my sentence</i> of text.</p>";
Node content = Jsoup.parse(input).body().unwrap();
System.out.println(content.html());
By body()
you select body
element and by unwrap()
you remove body and only content remains.
So output is:
<p><b>This <i>is</i></b> <i>my sentence</i> of text.</p>