问题
I came up with three solutions so far:
The extremely inefficient standard library pow
and log2
functions:
int_fast16_t powlog(uint_fast16_t n)
{
return static_cast<uint_fast16_t>(pow(2, floor(log2(n))));
}
Far more efficient counting subsequent powers of 2 until I reach a greater number than I had to reach:
uint_fast16_t multiply(uint_fast16_t n)
{
uint_fast16_t maxpow = 1;
while(2*maxpow <= n)
maxpow *= 2;
return maxpow;
}
The most efficient so far binsearching a precomputed table of powers of 2:
uint_fast16_t binsearch(uint_fast16_t n)
{
static array<uint_fast16_t, 20> pows {1,2,4,8,16,32,64,128,256,512,
1024,2048,4096,8192,16384,32768,65536,131072,262144,524288};
return *(upper_bound(pows.begin(), pows.end(), n)-1);
}
Can this be optimized even more? Any tricks that could be used here?
Full benchmark I used:
#include <iostream>
#include <chrono>
#include <cmath>
#include <cstdint>
#include <array>
#include <algorithm>
using namespace std;
using namespace chrono;
uint_fast16_t powlog(uint_fast16_t n)
{
return static_cast<uint_fast16_t>(pow(2, floor(log2(n))));
}
uint_fast16_t multiply(uint_fast16_t n)
{
uint_fast16_t maxpow = 1;
while(2*maxpow <= n)
maxpow *= 2;
return maxpow;
}
uint_fast16_t binsearch(uint_fast16_t n)
{
static array<uint_fast16_t, 20> pows {1,2,4,8,16,32,64,128,256,512,
1024,2048,4096,8192,16384,32768,65536,131072,262144,524288};
return *(upper_bound(pows.begin(), pows.end(), n)-1);
}
high_resolution_clock::duration test(uint_fast16_t(powfunct)(uint_fast16_t))
{
auto tbegin = high_resolution_clock::now();
volatile uint_fast16_t sink;
for(uint_fast8_t i = 0; i < UINT8_MAX; ++i)
for(uint_fast16_t n = 1; n <= 999999; ++n)
sink = powfunct(n);
auto tend = high_resolution_clock::now();
return tend - tbegin;
}
int main()
{
cout << "Pow and log took " << duration_cast<milliseconds>(test(powlog)).count() << " milliseconds." << endl;
cout << "Multiplying by 2 took " << duration_cast<milliseconds>(test(multiply)).count() << " milliseconds." << endl;
cout << "Binsearching precomputed table of powers took " << duration_cast<milliseconds>(test(binsearch)).count() << " milliseconds." << endl;
}
Compiled with -O2
this gave the following results on my laptop:
Pow and log took 19294 milliseconds.
Multiplying by 2 took 2756 milliseconds.
Binsearching precomputed table of powers took 2278 milliseconds.
回答1:
Versions with intrinsics have already been suggested in the comments, so here's a version that does not rely on them:
uint32_t highestPowerOfTwoIn(uint32_t x)
{
x |= x >> 1;
x |= x >> 2;
x |= x >> 4;
x |= x >> 8;
x |= x >> 16;
return x ^ (x >> 1);
}
This works by first "smearing" the highest set bit to the right, and then x ^ (x >> 1)
keeps only the bits that differ from the bit directly left of them (the msb is considered to have a 0 to left of it), which is only the highest set bit because thanks to the smearing the number is of the form 0n1m (in string notation, not numerical exponentiation).
Since no one is actually posting it, with intrinsics you could write (GCC, Clang)
uint32_t highestPowerOfTwoIn(uint32_t x)
{
return 0x80000000 >> __builtin_clz(x);
}
Or (MSVC, probably, not tested)
uint32_t highestPowerOfTwoIn(uint32_t x)
{
unsigned long index;
// ignoring return value, assume x != 0
_BitScanReverse(&index, x);
return 1u << index;
}
Which, when directly supported by the target hardware, should be better.
Results on coliru, and latency results on coliru (compare with the baseline too, which should be roughly indicative of the overhead). In the latency result, the first version of highestPowerOfTwoIn
doesn't look so good anymore (still OK, but it is a long chain of dependent instructions so it's not a big surprise that it widens the gap with the intrinsics version). Which one of these is the most relevant comparison depends on your actual usage.
If you have some odd hardware with a fast bit-reversal operation (but maybe slow shifts or slow clz
), let's call it _rbit
, then you can do
uint32_t highestPowerOfTwoIn(uint32_t x)
{
x = _rbit(x);
return _rbit(x & -x);
}
This is of course based on the old x & -x
which isolates the lowest set bit, surrounded by bit reversals it's isolating the highest set bit.
回答2:
The lookup table looks like the best option here. Hence, to answer
Can this be optimized even more? Any tricks that could be used here?
Yes we can! Let us beat the standard library binary search!
template <class T>
inline size_t
choose(T const& a, T const& b, size_t const& src1, size_t const& src2)
{
return b >= a ? src2 : src1;
}
template <class Container>
inline typename Container::const_iterator
fast_upper_bound(Container const& cont, typename Container::value_type const& value)
{
auto size = cont.size();
size_t low = 0;
while (size > 0) {
size_t half = size / 2;
size_t other_half = size - half;
size_t probe = low + half;
size_t other_low = low + other_half;
auto v = cont[probe];
size = half;
low = choose(v, value, low, other_low);
}
return begin(cont)+low;
}
Using this implementation of upper_bound
gives me a substantial improvement:
g++ -std=c++14 -O2 -Wall -Wno-unused-but-set-variable -Werror main.cpp && ./a.out
Pow and log took 2536 milliseconds.
Multiplying by 2 took 320 milliseconds.
Binsearching precomputed table of powers took 349 milliseconds.
Binsearching (opti) precomputed table of powers took 167 milliseconds.
(live on coliru) Note that I've improved your benchmark to use random values; by doing so I removed the branch prediction bias.
Now, if you really need to push harder, you can optimize the choose
function with x86_64 asm for clang:
template <class T> inline size_t choose(T const& a, T const& b, size_t const& src1, size_t const& src2)
{
#if defined(__clang__) && defined(__x86_64)
size_t res = src1;
asm("cmpq %1, %2; cmovaeq %4, %0"
:
"=q" (res)
:
"q" (a),
"q" (b),
"q" (src1),
"q" (src2),
"0" (res)
:
"cc");
return res;
#else
return b >= a ? src2 : src1;
#endif
}
With output:
clang++ -std=c++14 -O2 -Wall -Wno-unused-variable -Wno-missing-braces -Werror main.cpp && ./a.out
Pow and log took 1408 milliseconds.
Multiplying by 2 took 351 milliseconds.
Binsearching precomputed table of powers took 359 milliseconds.
Binsearching (opti) precomputed table of powers took 153 milliseconds.
(Live on coliru)
回答3:
Climbs faster but falls back same speed.
uint multiply_quick(uint n)
{
if (n < 2u) return 1u;
uint maxpow = 1u;
if (n > 256u)
{
maxpow = 256u * 128u;
// fast fixing the overshoot
while (maxpow > n)
maxpow = maxpow >> 2;
// fixing the undershoot
while (2u * maxpow <= n)
maxpow *= 2u;
}
else
{
// quicker scan
while (maxpow < n && maxpow != 256u)
maxpow *= maxpow;
// fast fixing the overshoot
while (maxpow > n)
maxpow = maxpow >> 2;
// fixing the undershoot
while (2u * maxpow <= n)
maxpow *= 2u;
}
return maxpow;
}
maybe this is better suited for 32bit variables using 65k constant literal instead of 256.
回答4:
Just set to 0 all bits but the first one. This should be very fast and efficient
回答5:
As @Jack already mentioned you can simply set to 0 all bits except first one. And here solution:
#include <iostream>
uint16_t bit_solution(uint16_t num)
{
if ( num == 0 )
return 0;
uint16_t ret = 1;
while (num >>= 1)
ret <<= 1;
return ret;
}
int main()
{
std::cout << bit_solution(1024) << std::endl; //1024
std::cout << bit_solution(1025) << std::endl; //1024
std::cout << bit_solution(1023) << std::endl; //512
std::cout << bit_solution(1) << std::endl; //1
std::cout << bit_solution(0) << std::endl; //0
}
回答6:
Well, it's still a loop (and its loop count depends on the number of set bits since they are reset one by one), so its worst case is likely to be worse than the approaches using block bit manipulations.
But it's cute.
uint_fast16_t bitunsetter(uint_fast16_t n)
{
while (uint_fast16_t k = n & (n-1))
n = k;
return n;
}
来源:https://stackoverflow.com/questions/42595326/how-to-efficiently-count-the-highest-power-of-2-that-is-less-than-or-equal-to-a