Convert formatted email (HTML) to plain Text?

故事扮演 提交于 2019-12-23 06:08:19

问题


I have this code that implements ParserCallback and converts HTML emails to Plain text. This code works fine when I parse email body like this =

  "DO NOT REPLY TO THIS EMAIL MESSAGE.   <br>---------------------------------------<br>\n" +
                "nix<br>---------------------------------------<br> Esfghjdfkj\n" +
                "</blockquote></div><br><br clear=\"all\"><div><br></div>-- <br><div dir=\"ltr\"><b>Regards <br>Nisj<br>Software Engineer<br></b><div><b>Bingo</b></div></div>\n" +
                "</div>"

but when I parse this kinda email body, it returns null,

 email = "<html><head><meta http-equiv=\"Content-Type\" content=\"text/html charset=us-ascii\"></head><body style=\"word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;\">Got it...so pls send to customer now.<div><br><div style=\"\"><div>On Nov 8, 2013, at 12:31 PM, <a href=\"mailto:xxxxxxx.com\">xxxxxxx.com</a> wrote:</div><br class=\"Apple-interchange-newline\"><blockquote type=\"cite\">Forwarding test.<br>---------------------------------------<br> ABCD.</blockquote></div><br></div></body></html>";

Code :

import java.io.IOException;
import java.io.StringReader;

import javax.swing.text.MutableAttributeSet;
import javax.swing.text.html.HTML.Attribute;
import javax.swing.text.html.HTML.Tag;
import javax.swing.text.html.HTMLEditorKit.Parser;
import javax.swing.text.html.HTMLEditorKit.ParserCallback;
import javax.swing.text.html.parser.ParserDelegator;

public class EmailBody {
    public static void main(String[] args) throws IOException
    {
        String email = "";

        class EmailCallback extends ParserCallback
        {
            private String body_;
            private boolean divStarted_;

            public String getBody()
            {
                return body_;
            }

            @Override
            public void handleStartTag(Tag t, MutableAttributeSet a, int pos)
            {
                if (t.equals(Tag.DIV) && "ltr".equals(a.getAttribute(Attribute.DIR)))
                {
                    divStarted_ = true;
                }
            }

            @Override
            public void handleEndTag(Tag t, int pos)
            {
                if (t.equals(Tag.DIV))
                {
                    divStarted_ = false;
                }
            }

            @Override
            public void handleText(char[] data, int pos)
            {
                if (divStarted_)
                {
                    body_ = new String(data);
                }
            }
        }
        EmailCallback callback = new EmailCallback();
        Parser parser = new ParserDelegator();
        StringReader reader = new StringReader(email);
        parser.parse(reader, callback, true);
        reader.close();
        System.out.println(callback.getBody());
    }
}

Can you tell the reason, why this is happening ?


回答1:


You code will only take the element text from DIV elements which have a dir attribute with an ltr value. The handleText method will only handle the element text if the divStarted_ flag is true, which happens only if the handleStartTag set this flag to true.
In the first email example you have such elements, in the second one you do not have them.



来源:https://stackoverflow.com/questions/19902403/convert-formatted-email-html-to-plain-text

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!