Similar image search by pHash distance in Elasticsearch

后端 未结 7 1545
悲哀的现实
悲哀的现实 2020-12-07 10:51

Similar image search problem

  • Millions of images pHash\'ed and stored in Elasticsearch.
  • Format is \"11001101...11\" (length 64), but can be changed (
7条回答
  •  心在旅途
    2020-12-07 11:09

    I've started on a solution to this myself. I've only tested so far against a data set of around 3.8million documents, and I intend to push that upwards of tens-of-millions now.

    My solution so far, is this:

    Write a native scoring function and register it as a plugin. Then call this when querying to adjust the _score value of documents as they come back.

    As a groovy script, the time taken to run the custom scoring function was extremely unimpressive, but writing it as a native scoring function (as demonstrated in this somewhat aged blog post: http://www.spacevatican.org/2012/5/12/elasticsearch-native-scripts-for-dummies/) was orders of magnitude faster.

    My HammingDistanceScript looked something like this:

    public class HammingDistanceScript extends AbstractFloatSearchScript {
    
        private String field;
        private String hash;
        private int length;
    
        public HammingDistanceScript(Map params) {
            super();
            field = (String) params.get("param_field");
            hash = (String) params.get("param_hash");
            if(hash != null){
                length = hash.length() * 8;
            }
        }
    
        private int hammingDistance(CharSequence lhs, CharSequence rhs){          
            return length - new BigInteger(lhs, 16).xor(new BigInteger(rhs, 16)).bitCount();
        }
    
        @Override
        public float runAsFloat() {
            String fieldValue = ((ScriptDocValues.Strings) doc().get(field)).getValue();
            //Serious arse covering:
            if(hash == null || fieldValue == null || fieldValue.length() != hash.length()){
                return 0.0f;
            }
    
            return hammingDistance(fieldValue, hash);
        }
    }
    

    It's worth mentioning at this point that my hashes are hex-encoded binary strings. So, the same as yours, but hex-encoded to reduce storage size.

    Also, I'm expecting a param_field parameter, which identifies which field value I want to do hamming distance against. You don't need to do this, but I'm using the same script against multiple fields, so I do :)

    I use it in queries like this:

    curl -XPOST 'http://localhost:9200/scf/_search?pretty' -d '{
      "query": {
        "function_score": {     
          "min_score": MY IDEAL MIN SCORE HERE,
          "query":{
           "match_all":{}
          },
          "functions": [
            {
              "script_score": {
                "script": "hamming_distance",
                "lang" : "native",
                "params": {
                  "param_hash": "HASH TO COMPARE WITH",
                  "param_field":"phash"
                }
              }
            }
          ]
        }
      }
    }'
    

    I hope this helps in some way!

    Other information that may be useful to you if you go this route:

    1. Remember the es-plugin.properties file
    This has to be compiled into the root of your jar file (if you stick it in /src/main/resources then build your jar it'll go in the right place).

    Mine looked like this:

    plugin=com.example.elasticsearch.plugins.HammingDistancePlugin
    name=hamming_distance
    version=0.1.0
    jvm=true
    classname=com.example.elasticsearch.plugins.HammingDistancePlugin
    java.version=1.7
    elasticsearch.version=1.7.3
    

    2. Reference your custom NativeScriptFactory impl in elasticsearch.yml
    Just like on aged blog post.

    Mine looked like this:

    script.native:
        hamming_distance.type: com.example.elasticsearch.plugins.HammingDistanceScriptFactory
    

    If you don't do this, it still shows up on the plugins list (see later) but you'll get errors when you try to use it saying that elasticsearch can't find it.

    3. Don't bother using the elasticsearch plugin script to install it
    It's just a pain the ass and all it seems to do is unpack your stuff - a bit pointless. Instead, just stick it in %ELASTICSEARCH_HOME%/plugins/hamming_distance and restart elasticsearch.

    If all has gone well, you'll see it being loaded on elasticsearch startup:

    [2016-02-09 12:02:43,765][INFO ][plugins                  ] [Junta] loaded [mapper-attachments, marvel, knapsack-1.7.2.0-954d066, hamming_distance, euclidean_distance, cloud-aws], sites [marvel, bigdesk]
    

    AND when you call the list of plugins it'll be there:

    curl http://localhost:9200/_cat/plugins?v
    

    produces something like:

    name        component                version type url
    Junta       hamming_distance         0.1.0   j
    

    I'm expecting to be able to test against upwards of tens-of-millions of documents within the next week or so. I'll try and remember to pop back and update this with the results, if it helps.

提交回复
热议问题