Issue on parsing Html with jsoup

血红的双手。 提交于 2019-12-11 23:58:30

问题


I am trying to parse this HTML using jsoup.

My code is:

doc = Jsoup.connect(htmlUrl).timeout(1000 * 1000).get();

            Elements items = doc.select("item");
            Log.d(TAG, "Items size : " + items.size());
            for (Element item : items) {
                Log.d(TAG, "in for loop of items");

                Element titleElement = item.select("title").first();
                mTitle = titleElement.text().toString();
                Log.d(TAG, "title is : " + mTitle);

                Element linkElement = item.select("link").first();
                mLink = linkElement.text().toString();
                Log.d(TAG, "link is : " + mLink);

                Element descElement = item.select("description").first();
                mDesc = descElement.text().toString();
                Log.d(TAG, "description is : " + mDesc);


            }

I am getting following output:

in for loop of items
D/HtmlParser( 6690): title is : Indonesian president: Some multinationals "take too much"
D/HtmlParser( 6690): link is : 
D/HtmlParser( 6690): description is : April 23 - Indonesian President Susilo Bambang Yudhoyono tells a Thomson Reuters Newsmaker event that the country welcomes foreign investment in its resources sector, but must receive a "fair share" of benefits.<div class="feedflare"> <a href="http://feeds.reuters.com/~ff/reuters/audio/newsmakerus/rss/mp3?a=NX3AY96GfGk:hAtGeOq2ESs:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/reuters/audio/newsmakerus/rss/mp3?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.reuters.com/~ff/reuters/audio/newsmakerus/rss/mp3?a=NX3AY96GfGk:hAtGeOq2ESs:V_sGLiPBpWU"><img src="http://feeds.feedburner.com/~ff/reuters/audio/newsmakerus/rss/mp3?i=NX3AY96GfGk:hAtGeOq2ESs:V_sGLiPBpWU" border="0"></img></a> <a href="http://feeds.reuters.com/~ff/reuters/audio/newsmakerus/rss/mp3?a=NX3AY96GfGk:hAtGeOq2ESs:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/reuters/audio/newsmakerus/rss/mp3?i=NX3AY96GfGk:hAtGeOq2ESs:F7zBnMyn0Lo" border="0"></img></a> </div><img src="http://feeds.feedburner.com/~r/reuters/audio/newsmakerus/rss/mp3/~4/NX3AY96GfGk" height="1" width="1"/>

But I want output as:

in for loop of items
D/HtmlParser( 6690): title is : Indonesian president: Some multinationals "take too much"
D/HtmlParser( 6690): link is : http://feeds.reuters.com/~r/reuters/audio/newsmakerus/rss/mp3/~3/KDcQe4gF-3U/62828262.mp3  
D/HtmlParser( 6690): description is : April 23 - Indonesian President Susilo Bambang Yudhoyono tells a Thomson Reuters Newsmaker event that the country welcomes foreign investment in its resources sector, but must receive a "fair share" of benefits.

What should I change in my code?

How to achieve my goal. Please help me!!

Thank you in advance!!


回答1:


There are 2 problems in rss content you fetched.

  1. The link text is not within the <link/> tag but outside of it.
  2. There is some escaped html content within the description tag.

PFB the modified code.

Also I found some clean html content when viewed the URL in Browser, which when parsed will make you easy to extract the desired fields. You can achieve that setting the userAgent as Browser in the Jsoup. But its up to you to decide how to fetch the content.

    doc = Jsoup.connect("http://feeds.reuters.com/reuters/audio/newsmakerus/rss/mp3/").timeout(0).get();
    System.out.println(doc.html());
    System.out.println("================================");
    Elements items = doc.select("item");
    for (Element item : items) {

        Element titleElement = item.select("title").first();
        String mTitle = titleElement.text();
        System.out.println("title is : " + mTitle);

        /*
         * The link in the rss is as follows
         *  <link />http://feeds.reuters.com/~r/reuters/audio/newsmakerus/rss/mp3/~3/NX3AY96GfGk/59621707.mp3 
         *  which doesn't fall in the <link> element but falls under <item> TextNode
         */
        String  mLink = item.ownText(); //  
        System.out.println("link is : " + mLink);

        Element descElement = item.select("description").first();
        /*Unescape the html content, Parse it to a doc, and then fetch only the text leaving behind all the html tags in content
         * "/" is a dummy baseURI passed, as we don't care about resolving the links within parsed content.
         */
        String  mDesc = Parser.parse(Parser.unescapeEntities(descElement.text(), false),"/" ).text(); 
        System.out.println("description is : " + mDesc);

    }


来源:https://stackoverflow.com/questions/17312544/issue-on-parsing-html-with-jsoup

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!