Get all text from an XML document?

前端 未结 5 1205
闹比i
闹比i 2020-12-11 07:17

How can I get all the text content of an XML document, as a single string - like this Ruby/hpricot example but using Python.

I\'d like to replace XML tags with a sin

5条回答
  •  春和景丽
    2020-12-11 07:37

    This very problem is actually an example in the lxml tutorial, which suggests using one of the following XPath expressions to get all the bits of text content from the document as a list of strings:

    • root.xpath("string()")
    • root.xpath("//text()")

    You'll then want to join these bits of text together into a single big string, with str.join probably using str.strip to get rid of leading and trailing whitespace on each bit and ignoring bits that are made entirely of whitespace:

    >>> from lxml import etree
    >>> root = etree.fromstring("""
    ... 
    ...   some text
    ...      
    ...   
    ...     foo bar
    ...   
    ...   yet more text
    ...   
    ...   even more text
    ... 
    ... """)
    >>> bits_of_text = root.xpath('//text()')
    >>> print(bits_of_text)  # Note that some bits are whitespace-only
    ['\n  some text\n  ', '   ', '\n  ', '\n    foo bar\n  ', '\n  yet more text\n  ', '\n  even more text\n']
    >>> joined_text = ' '.join(
    ...     bit.strip() for bit in bits_of_text
    ...     if bit.strip() != ''
    ... )
    >>> print(joined_text)
    some text foo bar yet more text even more text
    

    Note, by the way, that if you don't want to insert spaces between the bits of text you can just do

    etree.tostring(root, method='text', encoding='unicode')
    

    And if you're dealing with HTML instead of XML, and are using lxml.html to parse your HTML, you can just call the .text_content() method of your root node to get all the text it contains (although, again, no spaces will be inserted):

    >>> import lxml.html
    >>> root = lxml.html.document_fromstring('

    stuff

    more
    stuffbla') >>> root.text_content() 'stuffmore stuffbla'

提交回复
热议问题