Similar image search by pHash distance in Elasticsearch

后端 未结 7 1514
悲哀的现实
悲哀的现实 2020-12-07 10:51

Similar image search problem

  • Millions of images pHash\'ed and stored in Elasticsearch.
  • Format is \"11001101...11\" (length 64), but can be changed (
相关标签:
7条回答
  • 2020-12-07 11:03

    A recently proposed FENSHSES method from [1] seems to be the state-of-the-art one to conduct r-neighbor search in Hamming space on Elasticsearch.

    [1] Mu, C, Zhao, J., Yang, G., Yang, B. and Yan, Z., 2019, October. Fast and Exact Nearest Neighbor Search in Hamming Space on Full-Text Search Engines. In International Conference on Similarity Search and Applications (pp. 49-56). Springer, Cham.

    0 讨论(0)
  • 2020-12-07 11:08

    I have simulated and implemented a possible solution, which avoids all expensive "fuzzy" queries. Instead at index-time you take N random samples of M bits out of those 64 bits. I guess this is an example of Locality-sensitive hashing. So for each document (and when querying) sample numberx is always taken from same bit positions to have consistent hashing across documents.

    Queries use term filters at bool query's should clause with relatively low minimum_should_match threshold. Lower threshold corresponds to higher "fuzziness". Unfortunately you need to re-index all your images to test this approach.

    I think { "term": { "phash.0": true } } queries didn't perform well because on average each filter matches 50% of documents. With 16 bits / sample each sample matches 2^-16 = 0.0015% of documents.

    I run my tests with following settings:

    • 1024 samples / hash (stored to doc fields "0" - "ff")
    • 16 bits / sample (stored to short type, doc_values = true)
    • 4 shards and 1 million hashes / index, about 17.6 GB of storage (could be minimized by not storing _source and samples, only the original binary hash)
    • minimum_should_match = 150 (out of 1024)
    • Benchmarked with 4 million docs (4 indexes)

    You get faster speed and lower disk usage with fewer samples but documents between hamming distances of 8 and 9 aren't as well separated (according to my simulations). 1024 seems to be the maximum number of should clauses.

    Tests were run on a single Core i5 3570K, 24 GB RAM, 8 GB for ES, version 1.7.1. Results from 500 queries (see notes below, results are too optimistic):

    Mean time: 221.330 ms
    Mean docs: 197
    
    Percentiles:
       1st = 140.51ms
       5th = 150.17ms
      25th = 172.29ms
      50th = 207.92ms
      75th = 233.25ms
      95th = 296.27ms
      99th = 533.88ms
    

    I'll test how this scales to 15 million documents but it takes 3 hours to generate and store 1 million documents to each index.

    You should test or calculate how low you should set minimum_should_match to get the desired trade-off between missed matches and incorrect matches, this depends on the distribution of your hashes.

    Example query (3 out of 1024 fields shown):

    {
      "bool": {
        "should": [
          {
            "filtered": {
              "filter": {
                "term": {
                  "0": -12094,
                  "_cache": false
                }
              }
            }
          },
          {
            "filtered": {
              "filter": {
                "term": {
                  "_cache": false,
                  "1": -20275
                }
              }
            }
          },
          {
            "filtered": {
              "filter": {
                "term": {
                  "ff": 15724,
                  "_cache": false
                }
              }
            }
          }
        ],
        "minimum_should_match": 150
      }
    }
    

    Edit: When I started doing further benchmarks I noticed that I had generated too dissimilar hashes to different indexes, thus searching from those resulted in zero matches. Newly generated documents result in about 150 - 250 matches / index / query and should be more realistic.

    New results are shown in the graph before, I had 4 GB of memory for ES and remaining 20 GB for OS. Searching 1 - 3 indexes had good performance (median time 0.1 - 0.2 seconds) but searching more than this resulted in lots of disk IO and queries started taking 9 - 11 seconds! This could be circumvented by taking fewer samples of the hash but then recall and precision rates wouldn't be as good, alternatively you could have a machine with 64 GB of RAM and see how far you'll get.

    Edit 2: I re-generated data with _source: false and not storing hash samples (only the raw hash), this reduced storage space by 60% to about 6.7 GB / index (of 1 million docs). This didn't affect query speed on smaller datasets but when RAM wasn't sufficient and disk had to be used queries were about 40% faster.

    Edit 3: I tested fuzzy search with edit distance of 2 on a set of 30 million documents, and compared that to 256 random samples of the hash to get approximate results. Under these conditions methods are roughly the same speed but fuzzy gives exact results and doesn't need that extra disk space. I think this approach is only useful for "very fuzzy" queries like hamming distance of greater than 3.

    0 讨论(0)
  • 2020-12-07 11:08

    I also implemented the CUDA approach with some good results even on a laptop GeForce 650M graphics card. Implementation was easy with Thrust library. I hope the code doesn't have bugs (I didn't thoroughly test it) but it shouldn't affect benchmark results. At least I called thrust::system::cuda::detail::synchronize() before stopping the high-precision timer.

    typedef unsigned __int32 uint32_t;
    typedef unsigned __int64 uint64_t;
    
    // Maybe there is a simple 64-bit solution out there?
    __host__ __device__ inline int hammingWeight(uint32_t v)
    {
        v = v - ((v>>1) & 0x55555555);
        v = (v & 0x33333333) + ((v>>2) & 0x33333333);
    
        return ((v + (v>>4) & 0xF0F0F0F) * 0x1010101) >> 24;
    }
    
    __host__ __device__ inline int hammingDistance(const uint64_t a, const uint64_t b)
    {
        const uint64_t delta = a ^ b;
        return hammingWeight(delta & 0xffffffffULL) + hammingWeight(delta >> 32);
    }
    
    struct HammingDistanceFilter
    {
        const uint64_t _target, _maxDistance;
    
        HammingDistanceFilter(const uint64_t target, const uint64_t maxDistance) :
                _target(target), _maxDistance(maxDistance) {
        }
    
        __host__ __device__ bool operator()(const uint64_t hash) {
            return hammingDistance(_target, hash) <= _maxDistance;
        }
    };
    

    Linear searching was as easy as

    thrust::copy_if(
        hashesGpu.cbegin(), hashesGpu.cend(), matchesGpu.begin(),
        HammingDistanceFilter(target_hash, maxDistance)
    )
    

    Searching was 100% accurate and way faster than my ElasticSearch answer, in 50 milliseconds CUDA could stream through 35 million hashes! I'm sure newer desktop cards are even way faster than this. Also we get very low variance and consistent linear growth of search time as we go through more and more data. ElasticSearch hit bad memory problems on larger queries due to inflated sampling data.

    So here I'm reporting results of "From these N hashes, find those which are within 8 Hamming distance from a single hash H". I run these 500 times and reported percentiles.

    There is some kernel launch overhead but after the search space is more than 5 million hashes the searching speed is fairly consistent at 700 million hashes / second. Naturally the upper bound on number of hashes to be searched is set by GPU's RAM.

    Update: I re-run my tests on GTX 1060 and it scans about 3800 million hashes per second :)

    0 讨论(0)
  • 2020-12-07 11:09

    I've started on a solution to this myself. I've only tested so far against a data set of around 3.8million documents, and I intend to push that upwards of tens-of-millions now.

    My solution so far, is this:

    Write a native scoring function and register it as a plugin. Then call this when querying to adjust the _score value of documents as they come back.

    As a groovy script, the time taken to run the custom scoring function was extremely unimpressive, but writing it as a native scoring function (as demonstrated in this somewhat aged blog post: http://www.spacevatican.org/2012/5/12/elasticsearch-native-scripts-for-dummies/) was orders of magnitude faster.

    My HammingDistanceScript looked something like this:

    public class HammingDistanceScript extends AbstractFloatSearchScript {
    
        private String field;
        private String hash;
        private int length;
    
        public HammingDistanceScript(Map<String, Object> params) {
            super();
            field = (String) params.get("param_field");
            hash = (String) params.get("param_hash");
            if(hash != null){
                length = hash.length() * 8;
            }
        }
    
        private int hammingDistance(CharSequence lhs, CharSequence rhs){          
            return length - new BigInteger(lhs, 16).xor(new BigInteger(rhs, 16)).bitCount();
        }
    
        @Override
        public float runAsFloat() {
            String fieldValue = ((ScriptDocValues.Strings) doc().get(field)).getValue();
            //Serious arse covering:
            if(hash == null || fieldValue == null || fieldValue.length() != hash.length()){
                return 0.0f;
            }
    
            return hammingDistance(fieldValue, hash);
        }
    }
    

    It's worth mentioning at this point that my hashes are hex-encoded binary strings. So, the same as yours, but hex-encoded to reduce storage size.

    Also, I'm expecting a param_field parameter, which identifies which field value I want to do hamming distance against. You don't need to do this, but I'm using the same script against multiple fields, so I do :)

    I use it in queries like this:

    curl -XPOST 'http://localhost:9200/scf/_search?pretty' -d '{
      "query": {
        "function_score": {     
          "min_score": MY IDEAL MIN SCORE HERE,
          "query":{
           "match_all":{}
          },
          "functions": [
            {
              "script_score": {
                "script": "hamming_distance",
                "lang" : "native",
                "params": {
                  "param_hash": "HASH TO COMPARE WITH",
                  "param_field":"phash"
                }
              }
            }
          ]
        }
      }
    }'
    

    I hope this helps in some way!

    Other information that may be useful to you if you go this route:

    1. Remember the es-plugin.properties file
    This has to be compiled into the root of your jar file (if you stick it in /src/main/resources then build your jar it'll go in the right place).

    Mine looked like this:

    plugin=com.example.elasticsearch.plugins.HammingDistancePlugin
    name=hamming_distance
    version=0.1.0
    jvm=true
    classname=com.example.elasticsearch.plugins.HammingDistancePlugin
    java.version=1.7
    elasticsearch.version=1.7.3
    

    2. Reference your custom NativeScriptFactory impl in elasticsearch.yml
    Just like on aged blog post.

    Mine looked like this:

    script.native:
        hamming_distance.type: com.example.elasticsearch.plugins.HammingDistanceScriptFactory
    

    If you don't do this, it still shows up on the plugins list (see later) but you'll get errors when you try to use it saying that elasticsearch can't find it.

    3. Don't bother using the elasticsearch plugin script to install it
    It's just a pain the ass and all it seems to do is unpack your stuff - a bit pointless. Instead, just stick it in %ELASTICSEARCH_HOME%/plugins/hamming_distance and restart elasticsearch.

    If all has gone well, you'll see it being loaded on elasticsearch startup:

    [2016-02-09 12:02:43,765][INFO ][plugins                  ] [Junta] loaded [mapper-attachments, marvel, knapsack-1.7.2.0-954d066, hamming_distance, euclidean_distance, cloud-aws], sites [marvel, bigdesk]
    

    AND when you call the list of plugins it'll be there:

    curl http://localhost:9200/_cat/plugins?v
    

    produces something like:

    name        component                version type url
    Junta       hamming_distance         0.1.0   j
    

    I'm expecting to be able to test against upwards of tens-of-millions of documents within the next week or so. I'll try and remember to pop back and update this with the results, if it helps.

    0 讨论(0)
  • 2020-12-07 11:18

    I have used @ndtreviv's answer as a starting point. Here are my notes for ElasticSearch 2.3.3:

    1. es-plugin.properties file is now called plugin-descriptor.properties

    2. You do not reference NativeScriptFactory in elasticsearch.yml, instead you create an additional class next to your HammingDistanceScript.


    import org.elasticsearch.common.Nullable;
    import org.elasticsearch.plugins.Plugin;
    import org.elasticsearch.script.ExecutableScript;
    import org.elasticsearch.script.NativeScriptFactory;
    import org.elasticsearch.script.ScriptModule;
    
    import java.util.Map;
    
    public class StringMetricsPlugin extends Plugin {
        @Override
        public String name() {
            return "string-metrics";
        }
    
        @Override
        public  String description() {
            return "";
        }
    
        public void onModule(ScriptModule module) {
            module.registerScript("hamming-distance", HammingDistanceScriptFactory.class);
        }
    
        public static class HammingDistanceScriptFactory implements NativeScriptFactory {
            @Override
            public ExecutableScript newScript(@Nullable Map<String, Object> params) {
                return new HammingDistanceScript(params);
            }
            @Override
            public boolean needsScores() {
                return false;
            }
        }
    }
    
    1. Then reference this class in your plugin-descriptor.properties file:

    plugin=com.example.elasticsearch.plugins. StringMetricsPlugin
    name=string-metrics
    version=0.1.0
    jvm=true
    classname=com.example.elasticsearch.plugins.StringMetricsPlugin
    java.version=1.8
    elasticsearch.version=2.3.3
    
    1. You query by supplying the name you used in this line: module.registerScript("hamming-distance", HammingDistanceScriptFactory.class); in 2.

    Hope this helps the next poor soul that has to deal with the shitty ES docs.

    0 讨论(0)
  • 2020-12-07 11:21

    Here's an inelegant, but exact (brute force) solution that requires deconstructing your feature hash into individual boolean fields so you can run a query like this:

    "query": {
        "bool": {
          "minimum_should_match": -8,
          "should": [
              { "term": { "phash.0": true } },
              { "term": { "phash.1": false } },
              ...
              { "term": { "phash.63": true } }
            ]
        }
    }
    

    I'm not sure how this will perform vs. fuzzy_like_this, but the reason the FLT implementation is being deprecated is that it has to visit every term in the index to compute edit-distance.

    (whereas here/above, you are leveraging Lucene's underlying inverted-index data-structure and optimized set operations which should work to your advantage given you probably have fairly sparse features)

    0 讨论(0)
提交回复
热议问题