Reading <content:encoded> tags using BeautifulSoup 4

北战南征 提交于 2020-01-16 03:27:11

问题


I'm using BeautifulSoup 4 (bs4) to read an XML RSS feed, and have come across the following entry. I'm trying to read the content enclosed in the <content:encoded><![CDATA[...]]</content> tag:

<item>
    <title>Foobartitle</title>
    <link>http://www.acme.com/blah/blah.html</link>
    <category><![CDATA[mycategory]]></category>
    <description><![CDATA[The quick brown fox jumps over the lazy dog]]></description>
    <content:encoded>
        <![CDATA[<p><img class="feature" src="http://www.acme.com/images/image.jpg" alt="" /></p>]]>
    </content:encoded>
</item>

As I understand it, this format is part of the RSS content module and is pretty common.

I'd like to isolate the <content:encoded> tag and then read the CDATA contents. For the avoidance of doubt, the result would be <p><img class="feature" src="http://www.acme.com/images/image.jpg" alt="" /></p>.

I've looked at this, this, and this stackoverflow post but I've not been able to figure out how to get the job done since they are not directly related to my case.

I am using the lxml XML parser with bs4.

Any suggestions? thanks!


回答1:


from bs4 import BeautifulSoup

doc = ...
soup = BeautifulSoup(doc, "xml")  # Directs bs to use lxml

Interestingly, BeautifulSoup/lxml changes the tags around, noticeably from content:encoded to simply encoded.

>>> print soup
<?xml version="1.0" encoding="utf-8"?>
<item>
<title>Foobartitle</title>
<link>http://www.acme.com/blah/blah.html</link>
<category>mycategory</category>
<description>The quick brown fox jumps over the lazy dog</description>
<encoded>
        &lt;p&gt;&lt;img class="feature" src="http://www.acme.com/images/image.jpg" alt="" /&gt;&lt;/p&gt;
    </encoded>
</item>

From there, it should do to just parse through the children.

for encoded_content in soup.findAll("encoded"):
    for child in encoded_content.children:
        print child

That results in <p><img class="feature" src="http://www.acme.com/images/image.jpg" alt="" /></p>. Note, this seems to be an instance of bs4.element.NavigableString, not CData like in your linked answers.



来源:https://stackoverflow.com/questions/17437883/reading-contentencoded-tags-using-beautifulsoup-4

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!