I wrote some time ago a quicksort-based algorithm that you can find there (actually it is a selection algorithm, but can be used a sort algorithm too):
- http://processors.wiki.ti.com/index.php/Top_K_Elements_Sort_For_DSP_With_Index_Ordering
The lessons I learned from this experience are the following:
- Carefully tune the partition loop of your algorithm. This is often underestimated, but you do get a significant performance boost if you take care of writing loops that the compiler/CPU will be able to software pipeline. This alone has lead to a win of about 50% in CPU cyles.
- Hand-coding small sorts gives you a major peformance win. When the number of elements to be sorted in a partition is under 8 elements, just don't bother trying to recurse, but instead implement a hard-coded sort using just ifs and swaps (have a look at the fast_small_sort function in this code). This can lead to a win of about 50% in CPU cycles giving the quicksort the same practical performance as a well written "merge sort".
- Spend time to pick a better pivot value when a "poor" pivot selection is detected. My implementation starts using a "median of median" algorithm for pivot selection whenever a pivot selection leads to one side being under 16% of the remaining elements to be sorted. This is a mitigation strategy for worst-case performance of quick-sort, and help ensure that in practice the upper bound is also O(n*log(n)) instead of O(n^2).
- Optimize for arrays with lots of equal values (when needed). If the arrays to be sorted have lots of equal values it is worth optimizing for as it will lead to poor pivot selection. In my code I do it by counting all the array entries that are equal to the pivot value. This enables me treat the pivot and all the equal values in the array in a faster way, and doesn't degrade performance when it is not applicable. This is another mitigation strategy for worst-case performance, it helps reduce the worst-case stack usage by reducing the max recursion level in a drastic way.
I hope this helps, Laurent.