I am new to Haskell and I am trying to implement a few known algorithms in it.
I have implemented merge sort on strings. I am a bit disappointed with the performance of my Haskell implementation compared to C and Java implementations. On my machine (Ubuntu Linux, 1.8 GHz), C (gcc 4.3.3) sorts 1 000 000 strings in 1.85 s, Java (Java SE 1.6.0_14) in 3.68 s, Haskell (GHC 6.8.2) in 25.89 s. With larger input (10 000 000 strings), C takes 21.81 s, Java takes 59.68 s, Haskell starts swapping and I preferred to stop the program after several minutes.
Since I am new to Haskell, I would be interested to know if my implementation can be made more time / space efficient.
Thank you in advance for any hint Giorgio
My implementation:
merge :: [String] -> [String] -> [String]
merge [] ys = ys
merge xs [] = xs
merge (x:xs) (y:ys) = if x < y
then x : (merge xs (y:ys))
else y : (merge (x:xs) ys)
mergeSort :: [String] -> [String]
mergeSort xs = if (l < 2)
then xs
else merge h t
where l = length xs
n = l `div` 2
s = splitAt n xs
h = mergeSort (fst s)
t = mergeSort (snd s)
Try this version:
mergesort :: [String] -> [String]
mergesort = mergesort' . map wrap
mergesort' :: [[String]] -> [String]
mergesort' [] = []
mergesort' [xs] = xs
mergesort' xss = mergesort' (merge_pairs xss)
merge_pairs :: [[String]] -> [[String]]
merge_pairs [] = []
merge_pairs [xs] = [xs]
merge_pairs (xs:ys:xss) = merge xs ys : merge_pairs xss
merge :: [String] -> [String] -> [String]
merge [] ys = ys
merge xs [] = xs
merge (x:xs) (y:ys)
= if x > y
then y : merge (x:xs) ys
else x : merge xs (y:ys)
wrap :: String -> [String]
wrap x = [x]
- Bad idea is splitting list first. Instead of it just make list of one member lists. Haskell is lazy, it will be done in right time.
- Then merge pairs of lists until you have only one list.
Edit: Someone who down-vote this answer: above merge sort implementation is same algorithm as used in ghc Data.List.sort except with cmp function removed. Well ghc authors are may be wrong :-/
In Haskell, a string is a lazy list of characters and has the same overhead as any other list. If I remember right from a talk I heard Simon Peyton Jones give in 2004, the space cost in GHC is 40 bytes per character. For an apples-to-apples comparation you probably should be sorting Data.ByteString
, which is designed to give performance comparable to other languages.
Better way to split the list to avoid the issue CesarB points out:
split [] = ([], [])
split [x] = ([x], [])
split (x : y : rest) = (x : xs, y : ys)
where (xs, ys) = split rest
mergeSort [] = []
mergeSort [x] = [x]
mergeSort xs = merge (mergesort ys) (mergesort zs)
where (ys, zs) = split xs
EDIT: Fixed.
I am not sure if this is the cause of your problem, but remember that lists are a sequential data structure. In particular, both length xs
and splitAt n xs
will take an amount of time proportional to the length of the list (O(n)
).
In C and Java, you are most probably using arrays, which take constant time for both operations (O(1)
).
Edit: answering your question on how to make it more efficient, you can use arrays in Haskell too.
来源:https://stackoverflow.com/questions/1215432/merge-sort-in-haskell