I also implemented the CUDA approach with some good results even on a laptop GeForce 650M graphics card. Implementation was easy with Thrust library. I hope the code doesn't have bugs (I didn't thoroughly test it) but it shouldn't affect benchmark results. At least I called thrust::system::cuda::detail::synchronize()
before stopping the high-precision timer.
typedef unsigned __int32 uint32_t;
typedef unsigned __int64 uint64_t;
// Maybe there is a simple 64-bit solution out there?
__host__ __device__ inline int hammingWeight(uint32_t v)
{
v = v - ((v>>1) & 0x55555555);
v = (v & 0x33333333) + ((v>>2) & 0x33333333);
return ((v + (v>>4) & 0xF0F0F0F) * 0x1010101) >> 24;
}
__host__ __device__ inline int hammingDistance(const uint64_t a, const uint64_t b)
{
const uint64_t delta = a ^ b;
return hammingWeight(delta & 0xffffffffULL) + hammingWeight(delta >> 32);
}
struct HammingDistanceFilter
{
const uint64_t _target, _maxDistance;
HammingDistanceFilter(const uint64_t target, const uint64_t maxDistance) :
_target(target), _maxDistance(maxDistance) {
}
__host__ __device__ bool operator()(const uint64_t hash) {
return hammingDistance(_target, hash) <= _maxDistance;
}
};
Linear searching was as easy as
thrust::copy_if(
hashesGpu.cbegin(), hashesGpu.cend(), matchesGpu.begin(),
HammingDistanceFilter(target_hash, maxDistance)
)
Searching was 100% accurate and way faster than my ElasticSearch answer, in 50 milliseconds CUDA could stream through 35 million hashes! I'm sure newer desktop cards are even way faster than this. Also we get very low variance and consistent linear growth of search time as we go through more and more data. ElasticSearch hit bad memory problems on larger queries due to inflated sampling data.
So here I'm reporting results of "From these N hashes, find those which are within 8 Hamming distance from a single hash H". I run these 500 times and reported percentiles.
There is some kernel launch overhead but after the search space is more than 5 million hashes the searching speed is fairly consistent at 700 million hashes / second. Naturally the upper bound on number of hashes to be searched is set by GPU's RAM.
Update: I re-run my tests on GTX 1060 and it scans about 3800 million hashes per second :)