Can I get lxml to ignore non-XML content before and after the root tag?

问题

I'm trying to use lxml to process a file that may have some non-xml junk both before and after the XML content, imagine someone captured a terminal buffer and I have something like this:

user@host: cat /tmp/log.xml
<log>
  <foo>...</foo>
  <bar>..
...
</bar>

</log>

user@host:

If I hand etree.parse the filename, it chokes on the beginning content. I can delete the first set of lines until I find a line starting with '<' and hand that to etree.parse, but then it chokes on the closing content. The opening and closing non-xml junk could be anything. I could insist on just valid XML in the files, but I'm trying to be sort of tolerant of my input. Any ideas?

回答1:

Here's another point in the balance between convenience and correctness:

import re

xml = re.search(r"<(\w+).*</\1>", console_output, flags=re.DOTALL).group()

It expects a single root tag given in the above format.

回答2:

At most you can clean out everything that isn't a opening angle bracket from the front, and everything that isn't a closing angle bracket from the end:

data = data[data.find('<'):data.rfind('>')]

but this will fall over easily if there are any opening angle brackets at the start before the actual XML data, and any extra closing angle brackets at the end of the data. This is not uncommon in shell environments.

It'll be much easier on you if you just reject any such inputs instead.

来源：https://stackoverflow.com/questions/15208543/can-i-get-lxml-to-ignore-non-xml-content-before-and-after-the-root-tag

标签

python

xml

lxml