I want to speed up the following operation with AVX2 instructions, but I was not able to find a way to do so.
I am given a large array uint64_t data[100000]
uint64_t data[100000]
You can sort the data according to indices[i]... This should take O(N*log2(N)), but that can be parallelized.
Then taking the cumulative xor of the sorted data -- which can be also parallelized.
Then it's the matter of calculating Out[i] = CumXor(j) ^ Out[i-1];
Out[i] = CumXor(j) ^ Out[i-1];