问题
Is there a way to get the original location of an element in a document, ie. the start and end character index, when parsing html/xml in Python?
I've looked through the lxml documentation and couldn't find anything.
eg.
<a>1</a><b>2</b>
...
print tree.find('b').original_position
# result: (9, 16)
回答1:
Google found this, the gist of which is: it's hard for malformed documents because parsing requires synthesizing valid tokens that don't have any corresponding input. It's possible for valid documents, but most parsing libraries don't support it.
来源:https://stackoverflow.com/questions/8258529/parse-html-xml-and-find-locations-of-elements-in-original-document