Check Elasticsearch document similarity before indexing

痞子三分冷 提交于 2020-01-25 03:04:07

问题


Ok after having pulling my hair off all day long trying to figure that one out I decided to get some input from the community.

Should be mentioned that I'm fairly new to Elasticsearch.

The idea is that I have an ES index containing some documents and I need to index new documents only if no existing documents with similar field content (but not necessarily equals) are already indexed.

I can perform a match query on multiple field and get a global score for the query but since that score is not a percentage of the maximum score available I'm not sure how to set a threshold to determine if I can insert the document or not.

I am obviously a bit confused about the ES scoring system. Thanks in advance for all the help I can get on this.

EDIT:

As a basic example

This is already indexed:

{
  "title": "My first blog entry",
  "text":  "Just trying this out...",
  "date":  "2014/01/01"
}

This is new but should not be indexed since fields are not equals but too similar:

{
  "title": "My first blog entries",
  "text":  "Just trying it out...",
  "date":  "2014/01/01"
}

This is new and should be indexed:

{
  "title": "My second entry for this blog",
  "text":  "I am just trying out a few things",
  "date":  "2014/01/01"
}

So it's basically deduping prior indexing and based on fields similarity that I am after :)


回答1:


A perfect solution to your need is the more_like_this query.

In such query, you can provide artificial documents in the like field, that will be matched against documents in your index for similarity. By default they will use all available fields, but you can select a limited number of fields to be compared as well.

Most of the time, this query is used to retrieve documents similar to one or a few documents that the user might be looking at, or that the user has selected. Nonetheless, you can probably use this feature to analyze the score of the returned documents (if any) and decide wether to index your document or not.

Please refer to the documentation page linked above for a comprehensive list of parameters.



来源:https://stackoverflow.com/questions/35633799/check-elasticsearch-document-similarity-before-indexing

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!