I am trying to write very efficient Hamming-distance code. Inspired by Wojciech Muła\'s extremely clever SSE3 popcount implementation, I coded an AVX2 equivale
In addition to the minor issues in comments (compiling for /arch:AVX
) your primary problem is the generation of the random input arrays on each iteration. This is your bottleneck so your test is not effectively evaluating your methods. Note - I'm not using boost, but GetTickCount
works for this purpose. Consider just :
int count;
count = 0;
{
cout << "AVX PopCount\r\n";
unsigned int Tick = GetTickCount();
for (int i = 0; i < 1000000; i++) {
for (int j = 0; j < 16; j++) {
a[j] = dice();
b[j] = dice();
}
count += AVX_PopCount(a, b);
}
Tick = GetTickCount() - Tick;
cout << Tick << "\r\n";
}
produces output :
AVX PopCount
2309
256002470
So 2309ms to complete... but what happens if we get rid of your AVX routine altogether? Just make the input arrays :
int count;
count = 0;
{
cout << "Just making arrays...\r\n";
unsigned int Tick = GetTickCount();
for (int i = 0; i < 1000000; i++) {
for (int j = 0; j < 16; j++) {
a[j] = dice();
b[j] = dice();
}
}
Tick = GetTickCount() - Tick;
cout << Tick << "\r\n";
}
produces output:
Just making arrays...
2246
How about that. Not surprising, really, since you're generating 32 random numbers, which can be quite expensive, and then performing only some rather fast integer math and shuffles.
So...
Now let's add a factor of 100 more iterations and get the random generator out of the tight loop. Compiling here with optimizations disabled will run your code as expected and will not throw away the "useless" iterations - presumably the code we care about here is already (manually) optimized!
for (int j = 0; j < 16; j++) {
a[j] = dice();
b[j] = dice();
}
int count;
count = 0;
{
cout << "AVX PopCount\r\n";
unsigned int Tick = GetTickCount();
for (int i = 0; i < 100000000; i++) {
count += AVX_PopCount(a, b);
}
Tick = GetTickCount() - Tick;
cout << Tick << "\r\n";
}
cout << count << "\r\n";
count = 0;
{
cout << "SSE PopCount\r\n";
unsigned int Tick = GetTickCount();
for (int i = 0; i < 100000000; i++) {
count += SSE_PopCount(a, b);
}
Tick = GetTickCount() - Tick;
cout << Tick << "\r\n";
}
cout << count << "\r\n";
produces output :
AVX PopCount
3744
730196224
SSE PopCount
5616
730196224
So congratulations - you can pat yourself on the back, your AVX routine is indeed about a third faster than the SSE routine (tested on Haswell i7 here). The lesson is to be sure that you are actually profiling what you think you are profiling!