How can i grab CData out of BeautifulSoup

倖福魔咒の 提交于 2019-11-27 22:31:57

BeautifulSoup sees CData as a special case (subclass) of "navigable strings". So for example:

import BeautifulSoup

txt = '''<foobar>We have
       <![CDATA[some data here]]>
       and more.
       </foobar>'''

soup = BeautifulSoup.BeautifulSoup(txt)
for cd in soup.findAll(text=True):
  if isinstance(cd, BeautifulSoup.CData):
    print 'CData contents: %r' % cd

In your case of course you could look in the subtree starting at the div with the 'main-contents' ID, rather than all over the document tree.

You could try this:

from BeautifulSoup import BeautifulSoup

// source.html contains your html above
f = open('source.html')
soup = BeautifulSoup(''.join(f.readlines()))
s = soup.findAll('script')
cdata = s[0].contents[0]

That should give you the contents of cdata.

Update

This may be a little cleaner:

from BeautifulSoup import BeautifulSoup
import re

// source.html contains your html above
f = open('source.html')
soup = BeautifulSoup(''.join(f.readlines()))
cdata = soup.find(text=re.compile("CDATA"))

Just personal preference, but I like the bottom one a little better.

One thing you need to be careful of BeautifulSoup grabbing CData is not to use a lxml parser.

By default, the lxml parser will strip CDATA sections from the tree and replace them by their plain text content, Learn more here

#Trying it with html.parser


>>> from bs4 import BeautifulSoup
>>> import bs4
>>> s='''<?xml version="1.0" ?>
<foo>
    <bar><![CDATA[
        aaaaaaaaaaaaa
    ]]></bar>
</foo>'''
>>> soup = BeautifulSoup(s, "html.parser")
>>> soup.find(text=lambda tag: isinstance(tag, bs4.CData)).string.strip()
'aaaaaaaaaaaaa'
>>> 
import re
from bs4 import BeautifulSoup

soup = BeautifulSoup(content)
for x in soup.find_all('item'):
    print re.sub('[\[CDATA\]]', '', x.string)

For anyone using BeautifulSoup4, Alex Martelli's solution works but do this:

from bs4 import BeautifulSoup, CData

soup = BeautifulSoup(txt)
for cd in soup.findAll(text=True):
  if isinstance(cd, Cdata):
    print 'CData contents: %r' % cd
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!