Parsing lxml.etree._Element contents

筅森魡賤 提交于 2019-12-05 17:55:26

I know there must be a better way but this works.

link = td_elem.find('a').text.strip()
text = ''.join(td_elem.itertext()).strip()
text.split(link)[1]

Output is Power La Vaca(M8025K)Linux 4.2.x.x

Update: This is actually better if you want spaces in place of those <br>s

' '.join(map(str, [el.tail for el in td_elem.iterchildren() if el.tail]))

The map str isn't actually needed for this but I can imagine other values for which it would be.

When working with XML, even in Python, I like to try and use the domain specific tools that are available. For parsing bits of XML, XPath is it for me.

>>> td_elem = ET.fromstring(td_html)
>>>
>>> # Use XPath to grab just the text nodes under <td/>, 
>>> # ignoring any text nodes in child nodes of <td/> (i.e., <a...>5548U</a>)
>>> print(td_elem.xpath('/td/text()'))
['\n  ', 'Power La Vaca', '(M8025K)', 'Linux 4.2.x.x', '\n']
>>>
>>> # Make it a little cleaner
>>> ' '.join(x.strip() for x in td_elem.xpath('/td/text()'))
' Power La Vaca (M8025K) Linux 4.2.x.x '
>>>
>>> # Just for reference, grab all text nodes with '//'
>>> ' '.join(x.strip() for x in td_elem.xpath('/td//text()'))
' 5548U Power La Vaca (M8025K) Linux 4.2.x.x '
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!