I have a website that I\'m scraping that has a similar structure the following. I\'d like to be able to grab the info out of the CData block.
I\'m using BeautifulSo
BeautifulSoup sees CData as a special case (subclass) of "navigable strings". So for example:
import BeautifulSoup
txt = '''We have
and more.
'''
soup = BeautifulSoup.BeautifulSoup(txt)
for cd in soup.findAll(text=True):
if isinstance(cd, BeautifulSoup.CData):
print 'CData contents: %r' % cd
In your case of course you could look in the subtree starting at the div with the 'main-contents' ID, rather than all over the document tree.