问题
Here is my question:
I need to do that very efficiently (I will need to do this operation several billion times on supercomputers) in C or C++11. N and n are known at compile-time (template parameters). What is the most efficient algorithm to do that ?
Here is an example:
#include <iostream>
#include <climits>
#include <type_traits>
#include <bitset>
template <unsigned int Modulo,
typename Type,
unsigned int Size = sizeof(Type)*CHAR_BIT,
class = typename std::enable_if<std::is_integral<Type>::value
&& std::is_unsigned<Type>::value>::type>
inline Type f(Type x)
{
// The most inefficient algorithm ever
std::bitset<Size> bx(x);
std::bitset<Size> by(0);
unsigned int j = 0;
for (unsigned int i = 0; i < Size; ++i) {
if (i%Modulo) {
by[j++] = bx[i];
}
}
return by.to_ullong();
}
int main()
{
std::bitset<64> x = 823934823;
std::cout<<x<<std::endl;
std::cout<<(std::bitset<64>(f<2>(x.to_ullong())))<<std::endl;
return 0;
}
回答1:
Semantics first...
Semantically (and conceptually, because you can't actually use iterators here), you are doing a std::copy_if where your input and output ranges are a std::bitset<N> and your predicate is a lambda of the form (using C++14 generic lambda notation)
[](auto elem) { return elem % n != 0; }
This algorithm has O(N) complexity in the number of assignments and number of invocations of your predicate. Because std::bitset<N> doesn't have iterators, you have to check bit by bit. This means that your loop with a handwritten predicate is doing the exact same computation as a std::copy_if over a hypothetical iterable std::bitset<N>.
This means that as far as asympotic efficiency is concerned, your algorithm should not be considered as inefficient.
...optimization last
So given the conclusion that your algorithm isn't doing anything as bad as quadratic complexity, can its constant factor be optimized? The main source of efficiency of a std::bitset comes from the fact that your hardware can handle many (8, 16, 32 or 64) bits in parallel. If you had access to the implementation, you could write your own copy_if that takes advantage of that parallelism, e.g. by special hardware instructions, lookup tables, or some bit-twiddling algorithm.
E.g. this is how the member function count(), as well as the gcc and SGI extensions Find_first_() and Find_next_() are implemented. The old SGI implementation uses lookup tables of 256 entries to handle bit count and quasi-iteration over the bits of each 8-bit char. The latest gcc version uses __builtin_popcountll() and __builtin_ctzll() to do population count and bit lookup for each 64-bit word.
Unfortunately, std::bitset does not expose its underlying array of unsigned integers. So if you want to improve your posted algorithm, you need to write your own BitSet class template (possible by adapting the source of your own Standard Library) and give it a member function copy_if (or similar) that takes advantage of your hardware. It can give efficiency gains of a factor of 8 to 64 compared to your current algorithm.
来源:https://stackoverflow.com/questions/21141848/bits-twiddling-hack-most-efficient-way-to-remove-one-bit-every-n-bits