Parse HTML/XML and find locations of elements in original document

问题

Is there a way to get the original location of an element in a document, ie. the start and end character index, when parsing html/xml in Python?

I've looked through the lxml documentation and couldn't find anything.

eg.

<a>1</a><b>2</b>

...

print tree.find('b').original_position
# result: (9, 16)

回答1:

Google found this, the gist of which is: it's hard for malformed documents because parsing requires synthesizing valid tokens that don't have any corresponding input. It's possible for valid documents, but most parsing libraries don't support it.

来源：https://stackoverflow.com/questions/8258529/parse-html-xml-and-find-locations-of-elements-in-original-document

标签

python

xml-parsing

html-parsing

lxml

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!