How can i grab CData out of BeautifulSoup

后端未结

关注

 5  1192

I have a website that I\'m scraping that has a similar structure the following. I\'d like to be able to grab the info out of the CData block.

I\'m using BeautifulSo

相关标签:

5条回答

广开言路

2020-12-03 13:11
BeautifulSoup sees CData as a special case (subclass) of "navigable strings". So for example:
```
import BeautifulSoup

txt = '''<foobar>We have
       <![CDATA[some data here]]>
       and more.
       </foobar>'''

soup = BeautifulSoup.BeautifulSoup(txt)
for cd in soup.findAll(text=True):
  if isinstance(cd, BeautifulSoup.CData):
    print 'CData contents: %r' % cd
```
In your case of course you could look in the subtree starting at the div with the 'main-contents' ID, rather than all over the document tree.
0 讨论(0)
发布评论:

提交评论
- 加载中...

名媛妹妹

2020-12-03 13:12

One thing you need to be careful of BeautifulSoup grabbing CData is not to use a lxml parser.

By default, the lxml parser will strip CDATA sections from the tree and replace them by their plain text content, Learn more here

#Trying it with html.parser


>>> from bs4 import BeautifulSoup
>>> import bs4
>>> s='''<?xml version="1.0" ?>
<foo>
    <bar><![CDATA[
        aaaaaaaaaaaaa
    ]]></bar>
</foo>'''
>>> soup = BeautifulSoup(s, "html.parser")
>>> soup.find(text=lambda tag: isinstance(tag, bs4.CData)).string.strip()
'aaaaaaaaaaaaa'
>>>

0 讨论(0)

傲寒

2020-12-03 13:17

You could try this:

from BeautifulSoup import BeautifulSoup

// source.html contains your html above
f = open('source.html')
soup = BeautifulSoup(''.join(f.readlines()))
s = soup.findAll('script')
cdata = s[0].contents[0]

That should give you the contents of cdata.

Update

This may be a little cleaner:

from BeautifulSoup import BeautifulSoup
import re

// source.html contains your html above
f = open('source.html')
soup = BeautifulSoup(''.join(f.readlines()))
cdata = soup.find(text=re.compile("CDATA"))

Just personal preference, but I like the bottom one a little better.

0 讨论(0)

-上瘾入骨i

2020-12-03 13:17

import re
from bs4 import BeautifulSoup

soup = BeautifulSoup(content)
for x in soup.find_all('item'):
    print re.sub('[\[CDATA\]]', '', x.string)

0 讨论(0)

梦毁少年i

2020-12-03 13:17

For anyone using BeautifulSoup4, Alex Martelli's solution works but do this:

from bs4 import BeautifulSoup, CData

soup = BeautifulSoup(txt)
for cd in soup.findAll(text=True):
  if isinstance(cd, Cdata):
    print 'CData contents: %r' % cd

0 讨论(0)