Get all text from an XML document?

前端未结

关注

 5  1205

闹比i 2020-12-11 07:17

How can I get all the text content of an XML document, as a single string - like this Ruby/hpricot example but using Python.

I\'d like to replace XML tags with a sin

5条回答

春和景丽 (楼主)

2020-12-11 07:37
This very problem is actually an example in the lxml tutorial, which suggests using one of the following XPath expressions to get all the bits of text content from the document as a list of strings:
- root.xpath("string()")
- root.xpath("//text()")
You'll then want to join these bits of text together into a single big string, with str.join probably using str.strip to get rid of leading and trailing whitespace on each bit and ignoring bits that are made entirely of whitespace:
```
>>> from lxml import etree
>>> root = etree.fromstring("""
... 
...   some text
...      
...   
...     foo bar
...   
...   yet more text
...   
...   even more text
... 
... """)
>>> bits_of_text = root.xpath('//text()')
>>> print(bits_of_text)  # Note that some bits are whitespace-only
['\n  some text\n  ', '   ', '\n  ', '\n    foo bar\n  ', '\n  yet more text\n  ', '\n  even more text\n']
>>> joined_text = ' '.join(
...     bit.strip() for bit in bits_of_text
...     if bit.strip() != ''
... )
>>> print(joined_text)
some text foo bar yet more text even more text
```
Note, by the way, that if you don't want to insert spaces between the bits of text you can just do
```
etree.tostring(root, method='text', encoding='unicode')
```
And if you're dealing with HTML instead of XML, and are using lxml.html to parse your HTML, you can just call the .text_content() method of your root node to get all the text it contains (although, again, no spaces will be inserted):
```
>>> import lxml.html
>>> root = lxml.html.document_fromstring('stuffmore 
stuffbla')
>>> root.text_content()
'stuffmore stuffbla'
```
0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...