Elasticsearch: shingles with stop words elimination

泄露秘密 提交于 2020-02-02 04:16:53

问题


I am trying to implement an elasticsearch mapping to optimize phrase search in a large body of text. As per the suggestions in this article, I am using a shingle filter to build multiple unigrams per phrase.

Two questions:

  1. In the article mentioned, the stopwords are filtered and the shingles take care of the missing spaces by inserting "_" tokens. These tokens should be eliminated from the unigram that is indexed by the engine. The point of this elimination is be able to respond to phrase queries that contain all sorts of "useless" words. The standard solution (as mentioned in the article), is no longer possible, given that Lucene is deprecating a certain feature (enable_position_increments) needed for this kind of behaviour. How do I solve this kind of issue?

  2. Given the elimination of punctuation, I routinely see unigrams resulting from this shingling process that cover both phrases. From the point of view of search, any result that contains words from two separate phrases is not correct. How do I avoid (or mitigate) this kind of issues?

来源:https://stackoverflow.com/questions/22609100/elasticsearch-shingles-with-stop-words-elimination

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!