SAX Parser : Retrieving HTML tags from XML

情到浓时终转凉″ 提交于 2020-01-02 13:51:53

问题


I have an XML to be parsed, which as given below

<feed>
    <feed_id>12941450184d2315fa63d6358242</feed_id>
    <content> <fieldset><table cellpadding='0'  border='0'  cellspacing='0'  style="clear :both"><tr valign='top' ><td width='35' ><a href='http://mypage.rediff.com/android/32868898'  class='space' onmousedown="return enc(this,'http://track.rediff.com/click?url=___http%3A%2F%2Fmypage.rediff.com%2Fandroid%2F32868898___&service=mypage_feeds&clientip=202.137.232.113&pos=0&feed_id=12941450184d2315fa63d6358242&prc_id=32868898&rowid=674061088')" ><div style='width:25px;height:25px;overflow:hidden;'><img src='http://socialimg04.rediff.com/image.php?uid=32868898&type=thumb'  width='25'  vspace='0'  /></div></a></td> <td><span><a href='http://mypage.rediff.com/android/32868898'  class="space" onmousedown="return enc(this,'http://track.rediff.com/click?url=___http%3A%2F%2Fmypage.rediff.com%2Fandroid%2F32868898___&service=mypage_feeds&clientip=202.137.232.113&pos=0&feed_id=12941450184d2315fa63d6358242&prc_id=32868898&rowid=674061088')" >Android </a> </span><span style='color:#000000 !important;'>testing</span><div class='divtext'></div></td></tr><tr><td height='5' ></td></tr></table></fieldset><br/></content>
    <action>status updated</action>
</feed>

Tag contains HTML contents, which contains the data which i need. I am using a SAX Parser. Here's what i am doing

private Timeline timeLine; //Object
private String tempStr;

public void characters(char[] ch, int start, int length)
        throws SAXException {
    tempStr = new String(ch, start, length);
}

public void endElement(String uri, String localName, String qName)
        throws SAXException {
    if (localName.equalsIgnoreCase("content")) {
        if (timeLine != null) {
            timeLine.setContent(tempStr);
        }
}

Will this logic work? If no, how should i extract embedded HTML data from XML using SAX Parser.


回答1:


You can parse html after all html is also xml.There is a link similar to this in stackoverflow.You can try this How to parse the html content in android using SAX PARSER




回答2:


On start element, if the element is content, your temp Str buffer should be initialized. else if content already started, capture the current start element and its attributes and update that to the temp Str buffer.

On characters, if content is started, add the charecters to the current string buffer.

On end element if content is started, Capture the end node and add to string buffer.

My Assumption:

The xml will have only one content tag.




回答3:


If the html is actually xhtml, you can parse it using SAX and extract the xhtml contents of the <content> tag, but not nearly this easily.

You would have to make your handler actually respond to the events that will be raised by all the xhtml tags inside the <content> tag, and either build something resembling a DOM structure, which you could then serialize back out to xml form, or on-the-fly directly write into an xml string buffer replicating the contents.

If you modify your xml so that the html inside the content tag is wrapped in a CDATA element as suggested in How to parse the html content in android using SAX PARSER, something not too far from your code should indeed work.

But you can't just put the contents into your String tempStr variable in the characters method as you're doing. You'll need to have a startElement method that initializes a buffer for the string on seeing the <content> tag, collect into that buffer in the characters method, and put the result somewhere in the endElement for the <content> tag.




回答4:


I find the solution in this way:

Note: In this solution I want to get the html content between <chapter> tags (<chapter> ... html content ... </chapter>)

DefaultHandler handler = new DefaultHandler() {

    boolean chap = false;

    public char[] temp;
    int chapterStart;
    int chapterEnd;

    public void startElement(String uri, String localName,
            String qName, Attributes attributes)
            throws SAXException {

            System.out.println("Start Element :" + qName);

            if (qName.equalsIgnoreCase("chapter")) {
                chap = true;
            }

        }

        public void endElement(String uri, String localName,
            String qName) throws SAXException {

            if (qName.equalsIgnoreCase("chapter")) {
                System.out.println(new String(temp, chapterStart, chapterEnd-chapterStart));

            }
            System.out.println("End Element :" + qName);

        }

        public void characters(char ch[], int start, int length)
                throws SAXException {

            if (chap) {
                temp = ch;
                chapterStart = start;
                chap = false;
            }
            chapterEnd = start + length;

        }

    };

Update:

My code have a bug. because the length of ch[] in DocumentHandler varies in different situation!



来源:https://stackoverflow.com/questions/4602359/sax-parser-retrieving-html-tags-from-xml

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!