Locality-sensitive hashing - Elasticsearch

☆樱花仙子☆ 提交于 2019-12-06 18:39:12

问题


is there any plugin allowing LSH on Elasticsearch? If yes, could you point me to the location and tell me a little how to use it? Thanks

Edit: I found out that ES uses MinHash plugin. How could I compare documents to one another with this? What would be a good setting to find duplicates?


回答1:


  1. There is a Elasticsearch MinHash Plugin. You can use it to extract minhash value every time you index a document and query the document by minhash later.

    1. Install MinHash plugin:

      $ $ES_HOME/bin/plugin install org.codelibs/elasticsearch-minhash/2.3.1
      
    2. Add a minhash analyzer when creating your index:

      $ curl -XPUT 'localhost:9200/my_index' -d '{
        "index":{
          "analysis":{
            "analyzer":{
              "minhash_analyzer":{
                "type":"custom",
                "tokenizer":"standard",
                "filter":["minhash"]
              }
            }
          }
        }
      }'  
      
    3. Put minhash_value field into an index mapping:

      $ curl -XPUT "localhost:9200/my_index/my_type/_mapping" -d '{
        "my_type":{
          "properties":{
            "message":{
              "type":"string",
              "copy_to":"minhash_value"
            },
            "minhash_value":{
              "type":"minhash",
              "minhash_analyzer":"minhash_analyzer"
            }
          }
        }
      }'
      
    4. The minhash value is calculated automatically when adding document to the index you have created with minhash analyzer.
    5. a. Use More like this query can be used to do "like" search on the minhash_value field:

      GET /_search
      {
          "query": {
              "more_like_this" : {
                  "fields" : ["minhash_value"],
                  "like" : "KV5rsUfZpcZdVojpG8mHLA==",
                  "min_term_freq" : 1,
                  "max_query_terms" : 12
              }
          }
      }
      

      b. You can also use fuzzy query but it accepts the query to differ from the result by 2 (maximum).

      GET /_search
      {
          "query": {
             "fuzzy" : { "minhash_value" : "KV5rsUfZpcZdVojpG8mHLA==" }
          }
      } 
      

      You can find more about the fuzzy query here.

  2. Or you can create the hash value outside of elasicsearch (write a code to extract hash value) and everytime you index a document you can run the code and attach the hash value to the document you are indexing. And later search with the hash value using More Like This query or Fuzzy query as described above.
  3. Last but not least, you can write elasticsearch plugin yourself as above (which suits you hashing algorithm) and do the same step above.


来源:https://stackoverflow.com/questions/32777630/locality-sensitive-hashing-elasticsearch

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!