How to index html content, keeping positions (as xpath, css selector, etc)

╄→гoц情女王★ 提交于 2019-12-13 01:13:59

问题


I want to create a full-text search index for HTML content (to be more specific: EPUB chapters in XHTML format). Like this:

...
<p>Lorem ipsum <b>dolor</b> sit amet, consectetur adipiscing elit.</p>
...

The problem is that I need somehow the matched text's position (like xpath) with search results, because i need to position the reader software to the right place. I need a functionality like highlight feature, but instead of highlighted text, give the where-to-highlight position of matches. So if i search for "dolor" it gives back something like this:

matches:[
...
  {"match":"dolor", "xpath":"//*[@id="lipsum"]/p[1]/b"}
...
]

The standard scenario (what i found everywhere) like strip html chars with filter, then tokenize, etc, not applies here, because it loses the position information in the first step.

Any suggestions? Is that even possible with Solr or ElasticSearch? Thanks!


回答1:


Your question is about xpath as result of highlighting for a xhtml-Dokument.

I do not know about a running solution in solr or elasticsearch. There is something very similar in the eXtensible Text Framework(´XTF´) which is build on (an old version of) Lucene. In XTF you can get the highlighting as tags in the original xml-File. So it should be easy the write an xsl-Transformation to generate the corresponding xpaths.

Main idea in short would be to split the EPUB-book in overlapping chunks and store the xml-structure as special characters in the indexed and stored field. With highlighting information you can then reconvert the original xml-structure to find your xpaths.



来源:https://stackoverflow.com/questions/35253292/how-to-index-html-content-keeping-positions-as-xpath-css-selector-etc

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!