You can look at the numbers between a ConcurrentLinkedQueue and a BlockingQueue. What you will see is that CAS is noticeably faster under moderate (more realistic in real world applications) thread contention.
The most attractive property of nonblocking algorithms is the fact that if one thread fails (cache miss, or worse, seg fault) then other threads will not notice this failure and can move on. However, when acquiring a lock, if the lock holding thread has some kind of OS failure, every other thread waiting for the lock to be freed will be hit with the failure also.
To answer your questions, yes, nonblocking thread-safe algorithms or collections (ConcurrentLinkedQueue, ConcurrentSkipListMap/Set) can be significantly faster than their blocking counterparts. As Marcelo pointed out though, getting nonblocking algorithms correct is very difficult and requires a lot of consideration.
You should read about the Michael and Scott Queue, this is the queue implementation for ConcurrentLinkedQueue and explains how to handle a two-way, thread-safe, atomic function with a single CAS.