The hunt for the fastest Hamming Distance C implementation [duplicate]

问题

I want to find how many different characters two strings of equal length have. I have found that xoring algorithms are considered to be the fastest, but they return distance expressed in bits. I want the results expressed in characters. Suppose that "pet" and "pit" have distance 1 expressed in characters but 'e' and 'i' might have two different bits, so xoring returns 2.

The function i wrote is:

// na = length of both strings
unsigned int HammingDistance(const char* a, unsigned int na, const char* b) {

    unsigned int num_mismatches = 0;
    while (na) {
        if (*a != *b)
            ++num_mismatches;

        --na;
        ++a;
        ++b;
    }

    return num_mismatches;
}

Could it become any faster? Maybe using some lower level commands or implementing a different algorithm?

System: Gcc 4.7.2 on Intel Xeon X5650

Thank you

回答1:

You can make your comparison compare more bytes at a time by doing a bitwise operator on the native integer size.

In your code, you're comparing equality of a byte at a time, but your CPU can compare at least a word in a single cycle, and 8 bytes if it's x86-64. The exact performance capabilities depend on the CPU architecture, of course.

But if you would advance through the two pointers with a stride the size of 8, it could sure be faster in some scenarios. When it has to read from the strings from main memory, the memory load time will actually dominate the performance. But if the strings are in the CPU cache, You might be able to do an XOR, and interpret the results by testing where in the 64bit value the bits are changed.

Counting the buckets that aren't 0 can be done with a variant of the SWAR algorithm starting from 0x33333333 instead of 0x55555555.

The algorithm will be harder to work with, because it will require using uint64_t pointers that have proper memory alignment. You'll need a preamble and postscript that covers the leftover bytes. Maybe you should read the assembly the compiler outputs and see if it's not doing something more clever before you try something more complicated in code.

回答2:

Instead of

if (*a != *b)
    ++num_mismatches;

this would be faster on some architectures (with 8 bit bytes) because it avoids the branch:

int bits = *a ^ *b;
bits |= bits >> 4;
bits |= bits >> 2;
bits |= bits >> 1;
num_mismatches += bits & 1;

回答3:

How about loop unrolling:

while (na >= 8){
  num_mismatches += (a[0] != b[0]);
  num_mismatches += (a[1] != b[1]);
  num_mismatches += (a[2] != b[2]);
  num_mismatches += (a[3] != b[3]);
  num_mismatches += (a[4] != b[4]);
  num_mismatches += (a[5] != b[5]);
  num_mismatches += (a[6] != b[6]);
  num_mismatches += (a[7] != b[7]);
  a += 8; b += 8; na -= 8;
}
if (na >= 4){
  num_mismatches += (a[0] != b[0]);
  num_mismatches += (a[1] != b[1]);
  num_mismatches += (a[2] != b[2]);
  num_mismatches += (a[3] != b[3]);
  a += 4; b += 4; na -= 4;
}
if (na >= 2){
  num_mismatches += (a[0] != b[0]);
  num_mismatches += (a[1] != b[1]);
  a += 2; b += 2; na -= 2;
}
if (na >= 1){
  num_mismatches += (a[0] != b[0]);
  a += 1; b += 1; na -= 1;
}

Also, if you know there are long stretches of equal characters, you could cast the pointers to long* and compare them 4 at a time, and only if not equal look at the individual characters. This code is based on memset and memcpy being fast. It copies the strings into long arrays to 1) eliminate alignment issues, and 2) pad the strings with zeros out to an integer number of longs. As it compares each pair of longs, if they are not equal, it casts the pointers to char* and counts up the unequal characters. The main loop could also be unrolled, similar to the above.

long la[BIG_ENOUGH];
long lb[BIG_ENOUGH];
memset(la, 0, sizeof(la));
memset(lb, 0, sizeof(lb));
memcpy(la, a, na);
memcpy(lb, b, nb);
int nla = (na + 3) & ~3; // assuming sizeof(long) = 4
long *pa = la, *pb = lb;
while(nla >= 1){
  if (pa[0] != pb[0]){
    num_mismatches += (((char*)pa[0])[0] != ((char*)pb[0])[0])
                    + (((char*)pa[0])[1] != ((char*)pb[0])[1])
                    + (((char*)pa[0])[2] != ((char*)pb[0])[2])
                    + (((char*)pa[0])[3] != ((char*)pb[0])[3])
                    ;
  }
  pa += 1;pb += 1; nla -= 1;
}

回答4:

If the strings are padded with zero to always be 32 bytes and their addresses are 16-aligned, you could do something like this: (code neither tested nor profiled)

movdqa xmm0, [a]
movdqa xmm1, [a + 16]
pcmpeqb xmm0, [b]
pcmpeqb xmm1, [b + 16]
pxor xmm2, xmm2
psadbw xmm0, xmm2
psadbw xmm1, xmm2
pextrw ax, xmm0, 0
pextrw dx, xmm1, 0
add ax, dx
movsx eax, ax
neg eax

But if the strings are usually tiny, it does a lot of unnecessary work and it may not be any faster. It should be faster if the strings are usually (nearly) 32 bytes though.

edit: I wrote this answer before I saw your updated comment - if the strings are usually that tiny, this is probably not very good. A 16-byte version could (maybe) be useful though (run the second iteration conditionally, the branch for that should be well-predicted because it'll be rarely taken). But with such short strings, the normal code is hard to beat.

movdqa xmm0, [a]
pxor xmm1, xmm1
pcmpeqb xmm0, [b]
psadbw xmm0, xmm1
pextrw ax, xmm0, 0
movsx eax, ax
neg eax

来源：https://stackoverflow.com/questions/15987611/the-hunt-for-the-fastest-hamming-distance-c-implementation

标签

performance

assembly

optimization

hamming-distance