How can i grab CData out of BeautifulSoup

后端 未结 5 1186
春和景丽
春和景丽 2020-12-03 12:16

I have a website that I\'m scraping that has a similar structure the following. I\'d like to be able to grab the info out of the CData block.

I\'m using BeautifulSo

相关标签:
5条回答
  • 2020-12-03 13:11

    BeautifulSoup sees CData as a special case (subclass) of "navigable strings". So for example:

    import BeautifulSoup
    
    txt = '''<foobar>We have
           <![CDATA[some data here]]>
           and more.
           </foobar>'''
    
    soup = BeautifulSoup.BeautifulSoup(txt)
    for cd in soup.findAll(text=True):
      if isinstance(cd, BeautifulSoup.CData):
        print 'CData contents: %r' % cd
    

    In your case of course you could look in the subtree starting at the div with the 'main-contents' ID, rather than all over the document tree.

    0 讨论(0)
  • 2020-12-03 13:12

    One thing you need to be careful of BeautifulSoup grabbing CData is not to use a lxml parser.

    By default, the lxml parser will strip CDATA sections from the tree and replace them by their plain text content, Learn more here

    #Trying it with html.parser
    
    
    >>> from bs4 import BeautifulSoup
    >>> import bs4
    >>> s='''<?xml version="1.0" ?>
    <foo>
        <bar><![CDATA[
            aaaaaaaaaaaaa
        ]]></bar>
    </foo>'''
    >>> soup = BeautifulSoup(s, "html.parser")
    >>> soup.find(text=lambda tag: isinstance(tag, bs4.CData)).string.strip()
    'aaaaaaaaaaaaa'
    >>> 
    
    0 讨论(0)
  • 2020-12-03 13:17

    You could try this:

    from BeautifulSoup import BeautifulSoup
    
    // source.html contains your html above
    f = open('source.html')
    soup = BeautifulSoup(''.join(f.readlines()))
    s = soup.findAll('script')
    cdata = s[0].contents[0]
    

    That should give you the contents of cdata.

    Update

    This may be a little cleaner:

    from BeautifulSoup import BeautifulSoup
    import re
    
    // source.html contains your html above
    f = open('source.html')
    soup = BeautifulSoup(''.join(f.readlines()))
    cdata = soup.find(text=re.compile("CDATA"))
    

    Just personal preference, but I like the bottom one a little better.

    0 讨论(0)
  • 2020-12-03 13:17
    import re
    from bs4 import BeautifulSoup
    
    soup = BeautifulSoup(content)
    for x in soup.find_all('item'):
        print re.sub('[\[CDATA\]]', '', x.string)
    
    0 讨论(0)
  • 2020-12-03 13:17

    For anyone using BeautifulSoup4, Alex Martelli's solution works but do this:

    from bs4 import BeautifulSoup, CData
    
    soup = BeautifulSoup(txt)
    for cd in soup.findAll(text=True):
      if isinstance(cd, Cdata):
        print 'CData contents: %r' % cd
    
    0 讨论(0)
提交回复
热议问题