In my multithreaded application and I see heavy lock contention in it, preventing good scalability across multiple cores. I have decided to use lock free programming to solv
Most lock-free algorithms or structures start with some atomic operation, i.e. a change to some memory location that once begun by a thread will be completed before any other thread can perform that same operation. Do you have such an operation in your environment?
See here for the canonical paper on this subject.
Also try this wikipedia article article for further ideas and links.