I\'m using Jsoup for sanitizing user input from a form. The form in question contains a that expects plain text. When the form is submitted, I
Neeme Praks' answer was very good and preserved whitespace correctly. However, inline HTML really messes it up.
This is
some text. Cool story.
Results in
"This is"
Or if you pass in an element that doesn't have its own text, it returns null.
So I had to rework the method a little for my purposes. This might help some folks so I'm posting it here. The basic idea is to iterate the children instead of just taking the first one. This also includes a case to grab the HTML for any elements without children.
This way the original snippet returns:
This is
some text. Cool story.
public static String getText(Element cell) {
StringBuilder textBuilder = new StringBuilder();
for (Node node : cell.childNodes()) {
if (node instanceof TextNode) {
textBuilder.append(((TextNode)node).getWholeText());
}
else {
for (Node childNode : node.childNodes()) {
textBuilder.append(getText((Element)childNode));
}
textBuilder.append(node.outerHtml());
}
}
if (cell.childNodes().isEmpty()) {
textBuilder.append(cell.outerHtml());
}
return textBuilder.toString();
}