Raymond Chen has been doing a huge series on lockfree algorithms. Beyond the simple cases of the InterlockedXxx functions, it seems like the prevailing pattern
Under Windows on x64, a straightforward (no combining array in front of the freelist) lock-free freelist is about an order of magnitude faster than a mutex based freelist.
On my laptop (Core i5), for a single thread, lock-free I get about 31 million freelist operations per second, vs for mutex about 2.3 million operations per second.
For two threads (on separate physical cores), with lock-free I get about 12.4 million freelist operations per thread. With a mutex, I get about 80 THOUSAND operations per second.