I\'ve been doing some experimentation lately on using large numbers of random numbers to generate \"normal distribution\" bell curves.
The approach is simple:
<
Ok, thanks to Michael and Noel for your thoughtful responses.
Indeed it seems that arc4random() and arc4random_uniform() use a variant of a spin_lock, and performance is horrible in multi-threaded use.
It makes sense that a spin-lock is a really bad choice in a case where there are a lot of collisions, because a spin-lock causes the thread to block until the lock is released, thus tying up that core.
The ideal would be to create my own version of arc4random that maintains it's own state array in instance variables and is not thread-safe would probably be the best solution. I would then refactor my app to create a separate instance of my random generator for each thread.
However, this is a side-project for my own research. That's more effort than I'm prepared to expend if I'm not getting paid.
As an experiment, I replaced the code with rand(), and the single-threaded case is quite a bit faster, since rand() is a simpler, faster algorithm. The random numbers aren't as good either. From what I've read, rand() has problems with cyclic patterns in the lower bits, so instead of using the typical rand()%2, I used rand()%0x4000 instead, to use the second-to-highest order bit instead.
However, performance still decreased dramatically when I tried to use rand() in my multi-threaded code. It must use locking internally as well.
I then switched to rand_r(), which takes a pointer to a seed value, assuming that since it is stateless, it probably does not use locking.
Bingo. I now get 415,674 points/second running on my 8-core Mac Pro.