http://pastebin.com/YMS4ehRj
^ This is my implementation of parallel merge sort. Basically what I do is, For every split, the first half is handled by a thread where
Your parallelism is too fine-grained, there are too many threads which are doing just small work. You can define a threshold so that arrays which have smaller sizes than the threshold are sequentially sorted. Be careful about the number of spawned threads, a good indication is that the number of threads are usually not much bigger than the number of cores.
Because much of your computation is in merge function, another suggestion is using Divide-and-Conquer Merge instead of simple merge. The advantage is two-fold: the running time is smaller and it is easy to spawn threads for running parallel merging. You can get the idea of how to implement parallel merge here: http://drdobbs.com/high-performance-computing/229204454. They also have an article about Parallel Merge Sort which might be helpful for you: http://drdobbs.com/high-performance-computing/229400239