Apache tika: remove extra line breaks in result string

左心房为你撑大大i 提交于 2019-12-10 17:21:46

问题


I have html file:

<html><head></head><body><div style="font-family: Verdana;font-size: 12.0px;">
<div>Test message.</div>
<div>&nbsp;</div>
<div>More content here...</div>
<div>&nbsp;</div>
<div>Best regards,</div>
<div>Mr. Crowley</div></div></body></html>

I try to get content of the file above using Apache Tika...

final InputStream input = new FileInputStream("file.html");
final ContentHandler handler = new BodyContentHandler();
final Metadata metadata = new Metadata();

final HtmlParser htmlParser = new HtmlParser();
htmlParser.parse(input, handler, metadata, new ParseContext());
String plainText = handler.toString();
System.out.println(plainText);

...and all is fine except extra linebreaks:

Test message.

 

More content here...

 

Best regards,

Mr. Crowley
<and 3 empty lines here>

Is it possible to avoid this behavior? Is it possible to get more expected result:

Test message.
 
More content here...
 
Best regards,
Mr. Crowley

?

Code constructions like

plainText = plainText.replaceAll("(\n)+", "\n");

are unfortunately impossible here for me. Also I can't change the structure of my HTML file.


回答1:


One solution is to implement custom ContentHandler which would not write those new lines (still new lines from the original document will be kept):

public class OriginalBodyContentHandler extends BodyContentHandler {
    @Override
    public void ignorableWhitespace(char[] ch, int start, int length)
            throws SAXException {
        // Not writing extra new lines generated by XHTMLContentHandler.
    }
}


来源:https://stackoverflow.com/questions/17475613/apache-tika-remove-extra-line-breaks-in-result-string

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!