Reading tags using BeautifulSoup 4

问题

I'm using BeautifulSoup 4 (bs4) to read an XML RSS feed, and have come across the following entry. I'm trying to read the content enclosed in the <content:encoded><![CDATA[...]]</content> tag:

<item>
    <title>Foobartitle</title>
    <link>http://www.acme.com/blah/blah.html</link>
    <category><![CDATA[mycategory]]></category>
    <description><![CDATA[The quick brown fox jumps over the lazy dog]]></description>
    <content:encoded>
        <![CDATA[<p><img class="feature" src="http://www.acme.com/images/image.jpg" alt="" /></p>]]>
    </content:encoded>
</item>

As I understand it, this format is part of the RSS content module and is pretty common.

I'd like to isolate the <content:encoded> tag and then read the CDATA contents. For the avoidance of doubt, the result would be <p><img class="feature" src="http://www.acme.com/images/image.jpg" alt="" /></p>.

I've looked at this, this, and this stackoverflow post but I've not been able to figure out how to get the job done since they are not directly related to my case.

I am using the lxml XML parser with bs4.

Any suggestions? thanks!

回答1:

from bs4 import BeautifulSoup

doc = ...
soup = BeautifulSoup(doc, "xml")  # Directs bs to use lxml

Interestingly, BeautifulSoup/lxml changes the tags around, noticeably from content:encoded to simply encoded.

>>> print soup
<?xml version="1.0" encoding="utf-8"?>
<item>
<title>Foobartitle</title>
<link>http://www.acme.com/blah/blah.html</link>
<category>mycategory</category>
<description>The quick brown fox jumps over the lazy dog</description>
<encoded>
        &lt;p&gt;&lt;img class="feature" src="http://www.acme.com/images/image.jpg" alt="" /&gt;&lt;/p&gt;
    </encoded>
</item>

From there, it should do to just parse through the children.

for encoded_content in soup.findAll("encoded"):
    for child in encoded_content.children:
        print child

That results in <p><img class="feature" src="http://www.acme.com/images/image.jpg" alt="" /></p>. Note, this seems to be an instance of bs4.element.NavigableString, not CData like in your linked answers.

来源：https://stackoverflow.com/questions/17437883/reading-contentencoded-tags-using-beautifulsoup-4

标签

python

rss

beautifulsoup

Reading <content:encoded> tags using BeautifulSoup 4

问题

回答1: