Get character offsets for elements in jsoup

一世执手 提交于 2019-12-04 12:16:11

问题


I need to map jsoup elements back to specific character offsets in the source HTML. In other words, if I have HTML that looks like this:

Hello <br/> World

I need to know that "Hello " starts at offset 0 and has a length of 6 characters, <br/> starts at offset 6 and has a length of 5 characters, etc..

I could not find a getter in the Element javadoc that returns this information. Can it be retrieved?


回答1:


I don't believe Jsoup has this functionality. This question seems closer to lexical analysis than HTML parsing.

I would write a grammar, and then write a lexer against that grammar which would tokenize the HTML, and supply the offsets that you're looking for.

First, parse the document with Jsoup to verify that it is valid HTML.

Then, lexically analyze the document against a grammar. A grammar might look like:

Document := {optional-opening-tag} | {literal} {optional-opening-tag} | {optional-closing-tag}

optional-opening-tag := ["<" {literal} ">" {optional-opening-tag}|{literal} ] | ""

optional-closing-tag := "</ {literal} ">" | ""

literal := any string of characters not beginning with whitespace, or containing "<"

Insert each token that you find in an object which stores the token, the index of the first character, and the length.



来源:https://stackoverflow.com/questions/11387458/get-character-offsets-for-elements-in-jsoup

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!