How to set bits of a bit vector efficiently in parallel?
Consider a bit vector of N bits in it ( N is large) and an array of M numbers ( M is moderate, usually much smaller than N ), each in range 0..N-1 indicating which bit of the vector must be set to 1 . The latter array is not sorted. The bit vector is just an array of integers, specifically __m256i , where 256 bits are packed into each __m256i structure. How can this work be split efficiently accross multiple threads? Preferred language is C++ (MSVC++2017 toolset v141), assembly is also great. Preferred CPU is x86_64 (intrinsics are ok). AVX2 is desired, if any benefit from it. Let's assume you