How can i grab CData out of BeautifulSoup

核能气质少年 提交于 2019-11-26 23:10:33

问题


I have a website that I'm scraping that has a similar structure the following. I'd like to be able to grab the info out of the CData block.

I'm using BeautifulSoup to pull other info off the page, so if the solution can work with that, it would help keep my learning curve down as I'm a python novice. Specifically, I want to get at the two different types of data hidden in the CData statement. the first which is just text I'm pretty sure I can throw a regex at it and get what I need. For the second type, if i could drop the data that has html elements into it's own beautifulsoup, I can parse that.

I'm just learning python and beautifulsoup, so I'm struggling to find the magical incantation that will give me just the CData by itself.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"   "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">  
<head>  
<title>
   Cows and Sheep
  </title>
</head>
<body>
 <div id="main">
  <div id="main-precontents">
   <div id="main-contents" class="main-contents">
    <script type="text/javascript">
       //<![CDATA[var _ = g_cow;_[7654]={cowname_enus:'cows rule!',leather_quality:99,icon:'cow_level_23'};_[37357]={sheepname_enus:'baa breath',wool_quality:75,icon:'sheep_level_23'};_[39654].cowmeat_enus = '<table><tr><td><b class="q4">cows rule!</b><br></br>
       <!--ts-->
       get it now<table width="100%"><tr><td>NOW</td><th>NOW</th></tr></table><span>244 Cows</span><br></br>67 leather<br></br>68 Brains
       <!--yy-->
       <span class="q0">Cow Bonus: +9 Cow Power</span><br></br>Sheep Power 60 / 60<br></br>Sheep 88<br></br>Cow Level 555</td></tr></table>
       <!--?5695:5:40:45-->
       ';
        //]]>
      </script>
     </div>
     </div>
    </div>
 </body>
</html>

回答1:


BeautifulSoup sees CData as a special case (subclass) of "navigable strings". So for example:

import BeautifulSoup

txt = '''<foobar>We have
       <![CDATA[some data here]]>
       and more.
       </foobar>'''

soup = BeautifulSoup.BeautifulSoup(txt)
for cd in soup.findAll(text=True):
  if isinstance(cd, BeautifulSoup.CData):
    print 'CData contents: %r' % cd

In your case of course you could look in the subtree starting at the div with the 'main-contents' ID, rather than all over the document tree.




回答2:


You could try this:

from BeautifulSoup import BeautifulSoup

// source.html contains your html above
f = open('source.html')
soup = BeautifulSoup(''.join(f.readlines()))
s = soup.findAll('script')
cdata = s[0].contents[0]

That should give you the contents of cdata.

Update

This may be a little cleaner:

from BeautifulSoup import BeautifulSoup
import re

// source.html contains your html above
f = open('source.html')
soup = BeautifulSoup(''.join(f.readlines()))
cdata = soup.find(text=re.compile("CDATA"))

Just personal preference, but I like the bottom one a little better.




回答3:


One thing you need to be careful of BeautifulSoup grabbing CData is not to use a lxml parser.

By default, the lxml parser will strip CDATA sections from the tree and replace them by their plain text content, Learn more here

#Trying it with html.parser


>>> from bs4 import BeautifulSoup
>>> import bs4
>>> s='''<?xml version="1.0" ?>
<foo>
    <bar><![CDATA[
        aaaaaaaaaaaaa
    ]]></bar>
</foo>'''
>>> soup = BeautifulSoup(s, "html.parser")
>>> soup.find(text=lambda tag: isinstance(tag, bs4.CData)).string.strip()
'aaaaaaaaaaaaa'
>>> 



回答4:


import re
from bs4 import BeautifulSoup

soup = BeautifulSoup(content)
for x in soup.find_all('item'):
    print re.sub('[\[CDATA\]]', '', x.string)



回答5:


For anyone using BeautifulSoup4, Alex Martelli's solution works but do this:

from bs4 import BeautifulSoup, CData

soup = BeautifulSoup(txt)
for cd in soup.findAll(text=True):
  if isinstance(cd, Cdata):
    print 'CData contents: %r' % cd


来源:https://stackoverflow.com/questions/2032172/how-can-i-grab-cdata-out-of-beautifulsoup

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!