I've started on a solution to this myself. I've only tested so far against a data set of around 3.8million documents, and I intend to push that upwards of tens-of-millions now.
My solution so far, is this:
Write a native scoring function and register it as a plugin. Then call this when querying to adjust the _score value of documents as they come back.
As a groovy script, the time taken to run the custom scoring function was extremely unimpressive, but writing it as a native scoring function (as demonstrated in this somewhat aged blog post: http://www.spacevatican.org/2012/5/12/elasticsearch-native-scripts-for-dummies/) was orders of magnitude faster.
My HammingDistanceScript looked something like this:
public class HammingDistanceScript extends AbstractFloatSearchScript {
private String field;
private String hash;
private int length;
public HammingDistanceScript(Map params) {
super();
field = (String) params.get("param_field");
hash = (String) params.get("param_hash");
if(hash != null){
length = hash.length() * 8;
}
}
private int hammingDistance(CharSequence lhs, CharSequence rhs){
return length - new BigInteger(lhs, 16).xor(new BigInteger(rhs, 16)).bitCount();
}
@Override
public float runAsFloat() {
String fieldValue = ((ScriptDocValues.Strings) doc().get(field)).getValue();
//Serious arse covering:
if(hash == null || fieldValue == null || fieldValue.length() != hash.length()){
return 0.0f;
}
return hammingDistance(fieldValue, hash);
}
}
It's worth mentioning at this point that my hashes are hex-encoded binary strings. So, the same as yours, but hex-encoded to reduce storage size.
Also, I'm expecting a param_field parameter, which identifies which field value I want to do hamming distance against. You don't need to do this, but I'm using the same script against multiple fields, so I do :)
I use it in queries like this:
curl -XPOST 'http://localhost:9200/scf/_search?pretty' -d '{
"query": {
"function_score": {
"min_score": MY IDEAL MIN SCORE HERE,
"query":{
"match_all":{}
},
"functions": [
{
"script_score": {
"script": "hamming_distance",
"lang" : "native",
"params": {
"param_hash": "HASH TO COMPARE WITH",
"param_field":"phash"
}
}
}
]
}
}
}'
I hope this helps in some way!
Other information that may be useful to you if you go this route:
1. Remember the es-plugin.properties file
This has to be compiled into the root of your jar file (if you stick it in /src/main/resources then build your jar it'll go in the right place).
Mine looked like this:
plugin=com.example.elasticsearch.plugins.HammingDistancePlugin
name=hamming_distance
version=0.1.0
jvm=true
classname=com.example.elasticsearch.plugins.HammingDistancePlugin
java.version=1.7
elasticsearch.version=1.7.3
2. Reference your custom NativeScriptFactory impl in elasticsearch.yml
Just like on aged blog post.
Mine looked like this:
script.native:
hamming_distance.type: com.example.elasticsearch.plugins.HammingDistanceScriptFactory
If you don't do this, it still shows up on the plugins list (see later) but you'll get errors when you try to use it saying that elasticsearch can't find it.
3. Don't bother using the elasticsearch plugin script to install it
It's just a pain the ass and all it seems to do is unpack your stuff - a bit pointless. Instead, just stick it in %ELASTICSEARCH_HOME%/plugins/hamming_distance
and restart elasticsearch.
If all has gone well, you'll see it being loaded on elasticsearch startup:
[2016-02-09 12:02:43,765][INFO ][plugins ] [Junta] loaded [mapper-attachments, marvel, knapsack-1.7.2.0-954d066, hamming_distance, euclidean_distance, cloud-aws], sites [marvel, bigdesk]
AND when you call the list of plugins it'll be there:
curl http://localhost:9200/_cat/plugins?v
produces something like:
name component version type url
Junta hamming_distance 0.1.0 j
I'm expecting to be able to test against upwards of tens-of-millions of documents within the next week or so. I'll try and remember to pop back and update this with the results, if it helps.