Convert HTML to plain text in Java

前端 未结 6 2472
终归单人心
终归单人心 2021-02-20 08:51

I need to convert HTML to plain text. My only requirement of formatting is to retain new lines in the plain text. New lines should be displayed not only in the case of <

6条回答
  •  温柔的废话
    2021-02-20 09:23

    Have your parser append text content and newlines to a StringBuilder.

    final StringBuilder sb = new StringBuilder();
    HTMLEditorKit.ParserCallback parserCallback = new HTMLEditorKit.ParserCallback() {
        public boolean readyForNewline;
    
        @Override
        public void handleText(final char[] data, final int pos) {
            String s = new String(data);
            sb.append(s.trim());
            readyForNewline = true;
        }
    
        @Override
        public void handleStartTag(final HTML.Tag t, final MutableAttributeSet a, final int pos) {
            if (readyForNewline && (t == HTML.Tag.DIV || t == HTML.Tag.BR || t == HTML.Tag.P)) {
                sb.append("\n");
                readyForNewline = false;
            }
        }
    
        @Override
        public void handleSimpleTag(final HTML.Tag t, final MutableAttributeSet a, final int pos) {
            handleStartTag(t, a, pos);
        }
    };
    new ParserDelegator().parse(new StringReader(html), parserCallback, false);
    

提交回复
热议问题