How can I get all the text content of an XML document, as a single string - like this Ruby/hpricot example but using Python.
I\'d like to replace XML tags with a sin
This very problem is actually an example in the lxml tutorial, which suggests using one of the following XPath expressions to get all the bits of text content from the document as a list of strings:
root.xpath("string()")
root.xpath("//text()")
You'll then want to join these bits of text together into a single big string, with str.join
probably using str.strip
to get rid of leading and trailing whitespace on each bit and ignoring bits that are made entirely of whitespace:
>>> from lxml import etree
>>> root = etree.fromstring("""
...
... some text
...
...
... foo bar
...
... yet more text
...
... even more text
...
... """)
>>> bits_of_text = root.xpath('//text()')
>>> print(bits_of_text) # Note that some bits are whitespace-only
['\n some text\n ', ' ', '\n ', '\n foo bar\n ', '\n yet more text\n ', '\n even more text\n']
>>> joined_text = ' '.join(
... bit.strip() for bit in bits_of_text
... if bit.strip() != ''
... )
>>> print(joined_text)
some text foo bar yet more text even more text
Note, by the way, that if you don't want to insert spaces between the bits of text you can just do
etree.tostring(root, method='text', encoding='unicode')
And if you're dealing with HTML instead of XML, and are using lxml.html
to parse your HTML, you can just call the .text_content()
method of your root node to get all the text it contains (although, again, no spaces will be inserted):
>>> import lxml.html
>>> root = lxml.html.document_fromstring('stuff
more
stuffbla')
>>> root.text_content()
'stuffmore stuffbla'