Convert HTML to plain text in Java

妖精的绣舞 提交于 2019-12-04 02:14:42

Have your parser append text content and newlines to a StringBuilder.

final StringBuilder sb = new StringBuilder();
HTMLEditorKit.ParserCallback parserCallback = new HTMLEditorKit.ParserCallback() {
    public boolean readyForNewline;

    @Override
    public void handleText(final char[] data, final int pos) {
        String s = new String(data);
        sb.append(s.trim());
        readyForNewline = true;
    }

    @Override
    public void handleStartTag(final HTML.Tag t, final MutableAttributeSet a, final int pos) {
        if (readyForNewline && (t == HTML.Tag.DIV || t == HTML.Tag.BR || t == HTML.Tag.P)) {
            sb.append("\n");
            readyForNewline = false;
        }
    }

    @Override
    public void handleSimpleTag(final HTML.Tag t, final MutableAttributeSet a, final int pos) {
        handleStartTag(t, a, pos);
    }
};
new ParserDelegator().parse(new StringReader(html), parserCallback, false);

I would guess you could use the ParserCallback.

You would need to add code to support the tags that require special handling. There are:

  1. handleStartTag
  2. handleEndTag
  3. handleSimpleTag

callbacks that should allow you to check for the tags you want to monitor and then append a newline character to your buffer.

Building on your example, with a hint from html to plain text? message:

import java.io.*;

import org.jsoup.*;
import org.jsoup.nodes.*;

public class TestJsoup
{
  public void SimpleParse()
  {
    try
    {
      Document doc = Jsoup.connect("http://www.particle.kth.se/~lindsey/JavaCourse/Book/Part1/Java/Chapter09/scannerConsole.html").get();
      // Trick for better formatting
      doc.body().wrap("<pre></pre>");
      String text = doc.text();
      // Converting nbsp entities
      text = text.replaceAll("\u00A0", " ");
      System.out.print(text);
    }
    catch (IOException e)
    {
      e.printStackTrace();
    }
  }

  public static void main(String args[])
  {
    TestJsoup tjs = new TestJsoup();
    tjs.SimpleParse();
  }
}

You can use XSLT for this purpose. Take a look at this link which addresses a similar problem.

Hope it is helpful.

I would use SAX. If your document is not well-formed XHTML, I would transform it with JTidy.

John Camerin

JSoup is not FreeMarker (or any other customer/non-HTML tag) compatible. Consider this as the most pure solution for converting Html to plain text.

http://stackoverflow.com/questions/1518675/open-source-java-library-for-html-to-text-conversion/1519726#1519726 My code:

return new net.htmlparser.jericho.Source(html).getRenderer().setMaxLineLength(Integer.MAX_VALUE).setNewLine(null).toString();
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!