How to get inner content as string using minidom from xml.dom?

≯℡__Kan透↙ 提交于 2019-12-24 09:39:21

问题


I have some text tags in my xml file (pdf converted to xml using pdftohtml from popplers-utils) that looks like this:

<text top="525" left="170" width="603" height="16" font="1">..part of old large book</text>
<text top="546" left="128" width="645" height="16" font="1">with many many pages and some <i>italics text among 'plain' text</i> and more and more text</text>
<text top="566" left="128" width="642" height="16" font="1">etc...</text>

and I can get text envolved with text tag with this sample code:

import string
from xml.dom import minidom
xmldoc = minidom.parse('../test/text.xml')
itemlist = xmldoc.getElementsByTagName('text')

some_tag = itemlist[node_index]
output_text = some_tag.firstChild.nodeValue
# if there is all text inside <i> I can get it by
output_text = some_tag.firstChild.firstChild.nodeValue

# but no if <i></i> wrap only one word of the string

but I can not get "nodeValue" if it contents another tag (<i> or <b>...) inside and can not get object either

What is the best way to get all text as plain string like javascript innerHTML method or recurse into child tags even if they wraps some words and not entire nodeValue?

thanks


回答1:


**Question: How to get inner content as string using minidom

This is a Recursive Solution, for instance:

def getText(nodelist):
    # Iterate all Nodes aggregate TEXT_NODE
    rc = []
    for node in nodelist:
        if node.nodeType == node.TEXT_NODE:
            rc.append(node.data)
        else:
            # Recursive
            rc.append(getText(node.childNodes))
    return ''.join(rc)


xmldoc = minidom.parse('../test/text.xml')
nodelist = xmldoc.getElementsByTagName('text')

# Iterate <text ..>...</text> Node List
for node in nodelist:
    print(getText(node.childNodes))

Output:

..part of old large book
with many many pages and some italics text among 'plain' text and more and more text
etc...

Tested with Python: 3.4.2



来源:https://stackoverflow.com/questions/45603446/how-to-get-inner-content-as-string-using-minidom-from-xml-dom

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!