python xml.etree.ElementTree remove empty tag in the middle of text

你。 提交于 2021-01-29 14:49:49

问题


I have an xml document from which I want to extract text based on tags.
The part that I want to extract text from looks something like this :

<BlockText attr1="blah" attr2=657 ID="Bhf76" lang="en">
Simply dummy text of the printing and typesetting industry. It has survived not only<TIP CONTENT="­"/>\n five centuries, electronic typesetting, remaining essentially release.
</BlockText>

When I do

tree = ET.parse("myfile.xml")
root = tree.getroot()
tags = list(set([elem.tag for elem in root.iter()]))
tag = list(filter(lambda i: "BlockText" in i, tags))[0]
for text in root.iter(tag):
    texte = text.text

I'm only able to grab the part that comes before the empty tag <TIP CONTENT="­"/>
I tried to delete this tag before getting the rest of the text.
I did :

emptyTag = list(filter(lambda i: "TIP" in i, tags))
for e in root.iter(emptyTag) :
    root.remove(e)

But this is not working.
None of <BlockText> and <TIP> are direct children of root.


Thank you.


回答1:


The text After <TIP CONTENT="­"/> belongs to its own tail not the text of the BlockText tag.

elem.text is the text following the open tag. elem.tail is the text following the close tag. Usually whitespace but in this case it's has actual text.




回答2:


Ok this is what ended up working for me :

emptyTags = list(filter(lambda i: "TIP" in i, tags))
if emptyTags :
    emptyTag = list(filter(lambda i: "TIP" in i, emptyTags))[0]
for element in root.iter(emptyTag):
    print(element.tail)

But I still can't get the text as a whole block (same order). I can get all the BlockText tags and all the TIP tags but not together.

Update :
I used :

tree = ET.parse("myfile.xml")
root = tree.getroot()
tags = list(set([elem.tag for elem in root.iter()]))
tag = list(filter(lambda i: "BlockText" in i, tags))[0]
for text in root.iter(tag):
    texte = ''.join(text.itertext())



回答3:


Another solution for reference only

from simplified_scrapy import SimplifiedDoc
html = '''
<BlockText attr1="blah" attr2=657 ID="Bhf76" lang="en">
Simply dummy text of the printing and typesetting industry. It has survived not only<TIP CONTENT="­"/>\n five centuries, electronic typesetting, remaining essentially release.
</BlockText>
'''
doc = SimplifiedDoc(html)
print (doc.select('BlockText'))
print (doc.select('BlockText>text()'))
print (doc.selects('BlockText>text()'))

Result:

{'tag': 'BlockText', 'attr1': 'blah', 'attr2': '657', 'ID': 'Bhf76', 'lang': 'en', 'html': '\nSimply dummy text of the printing and typesetting industry. It has survived not only<TIP CONTENT="\xad" />\n five centuries, electronic typesetting, remaining essentially release.\n'}
Simply dummy text of the printing and typesetting industry. It has survived not only five centuries, electronic typesetting, remaining essentially release.
['Simply dummy text of the printing and typesetting industry. It has survived not only five centuries, electronic typesetting, remaining essentially release.']


来源:https://stackoverflow.com/questions/60321983/python-xml-etree-elementtree-remove-empty-tag-in-the-middle-of-text

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!