I was working on a kernel which had much global memory access per thread so I copied them to local memory which gave a speed up of 40%.
I wanted still more speed up
(I know this is an old question, but the answers given aren't very accurate, and I saw conflicting answers elsewhere during Google searches.)
According to "Heterogeneous Computing with OpenCL" (Revised OpenCL 1.2 Edition):
Private memory is memory that is unique to an individual work-item. Local variables and nonpointer kernel arguments are private by default. In practice, these variables are usually mapped to registers, although private arrays and any spilled registers are usually mapped to an off-chip (i.e., long-latency) memory.
So, if you use a great deal of private memory, or use arrays in private memory, yes, it can be slower than local memory.