Can I get lxml to ignore non-XML content before and after the root tag?

我的梦境 提交于 2019-12-12 02:55:43

问题


I'm trying to use lxml to process a file that may have some non-xml junk both before and after the XML content, imagine someone captured a terminal buffer and I have something like this:

user@host: cat /tmp/log.xml
<log>
  <foo>...</foo>
  <bar>..
...
</bar>

</log>

user@host:

If I hand etree.parse the filename, it chokes on the beginning content. I can delete the first set of lines until I find a line starting with '<' and hand that to etree.parse, but then it chokes on the closing content. The opening and closing non-xml junk could be anything. I could insist on just valid XML in the files, but I'm trying to be sort of tolerant of my input. Any ideas?


回答1:


Here's another point in the balance between convenience and correctness:

import re

xml = re.search(r"<(\w+).*</\1>", console_output, flags=re.DOTALL).group()

It expects a single root tag given in the above format.




回答2:


At most you can clean out everything that isn't a opening angle bracket from the front, and everything that isn't a closing angle bracket from the end:

data = data[data.find('<'):data.rfind('>')]

but this will fall over easily if there are any opening angle brackets at the start before the actual XML data, and any extra closing angle brackets at the end of the data. This is not uncommon in shell environments.

It'll be much easier on you if you just reject any such inputs instead.



来源:https://stackoverflow.com/questions/15208543/can-i-get-lxml-to-ignore-non-xml-content-before-and-after-the-root-tag

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!