I have a special question. I will try to describe this as accurate as possible.
I am doing a very important \"micro-optimization\". A loop that runs for days at a ti
As mentioned by Peter Cordes, you could use SIMD to add multiple values together at a time, See vector. But it is not clear to me if this would actually help.
Edit: If you are running .Net core there are also SIMD intrinstics that provides lower level access to the hardware.
As mentioned by NerualHandle it might be better to use a for-loop than a foreach. But when I test it there does not seem to be a significant difference. I would guess the compiler can optimize foreach in this particular case.
When I am running your testbenchmark00 code it completes in ~6ms on my computer. Some rough calculations suggest each iteration of the loop takes about 0.78ns, or about 2-4 processor cycles, this seem to be near optimal. It seem odd that it takes ~20 times longer for you. Are you running in release mode?
You could parallelize the problem. Split the indexers array into multiple parts, and build the historgram for each part on different threads, and sum the historgram for each thread in the end. See Parallel.For since this can do the partitioning etc for you, but it requires the use of localInit and localFinally to ensure each thread writes to separate histograms to avoid concurrency issues.
As always with performance optimization, the recommended order to do things is: