The documentation of seq function says the following:
A note on evaluation order: the expression
seq a bdoes not guarantee that
Edit: My theory foiled as the timings I observed were in fact heavily skewed by the influence of profiling itself; with profiling off, the data goes against the theory. Moreover, the timings vary quite a bit between versions of GHC. I am collecting better observations even now, and I will further edit this answer as I arrive to a conclusive point.
Concerning the question "why pseq is slower", I have a theory.
acc' `seq` go acc' xs as strict (go (strict acc') xs). acc' `pseq` go acc' xs is re-phrased as lazy (go (strict acc') xs).go acc (x:xs) = let ... in ... to go acc (x:xs) = strict (go (x + acc) xs) in the case of seq.go acc (x:xs) = lazy (go (x + acc) xs) in the case of pseq.Now, it is easy to see that, in the case of pseq, go gets assigned a lazy thunk that will be evaluated at some later point. In the definition of sum, go never appears to the left of pseq, and thus, during the run of sum, the evaulation will not at all be forced. Moreover, this happens for every recursive call of go, so thunks accumulate.
This is a theory built from thin air, but I do have a partial proof. Specifically, I did find out that go allocates linear memory in pseq case but not in the case of seq. You may see for yourself if you run the following shell commands:
for file in SumNaive.hs SumPseq.hs SumSeq.hs
do
stack ghc \
--library-profiling \
--package parallel \
-- \
$file \
-main-is ${file%.hs} \
-o ${file%.hs} \
-prof \
-fprof-auto
done
for file in SumNaive.hs SumSeq.hs SumPseq.hs
do
time ./${file%.hs} +RTS -P
done
-- And compare the memory allocation of the go cost centre.
COST CENTRE ... ticks bytes
SumNaive.prof:sum.go ... 782 559999984
SumPseq.prof:sum.go ... 669 800000016
SumSeq.prof:sum.go ... 161 0
postscriptum
Since there appears to be discord on the question of which optimizations actually play to what effect, I am putting my exact source code and time measures, so that there is a common baseline.
SumNaive.hs
module SumNaive where
import Prelude hiding (sum)
sum :: Num a => [a] -> a
sum = go 0
where
go acc [] = acc
go acc (x:xs) = go (x + acc) xs
main = print $ sum [1..10^7]
SumSeq.hs
module SumSeq where
import Prelude hiding (sum)
sum :: Num a => [a] -> a
sum = go 0
where
go acc [] = acc
go acc (x:xs) = let acc' = x + acc
in acc' `seq` go acc' xs
main = print $ sum [1..10^7]
SumPseq.hs
module SumPseq where
import Prelude hiding (sum)
import Control.Parallel (pseq)
sum :: Num a => [a] -> a
sum = go 0
where
go acc [] = acc
go acc (x:xs) = let acc' = x + acc
in acc' `pseq` go acc' xs
main = print $ sum [1..10^7]
Time without optimizations:
./SumNaive +RTS -P 4.72s user 0.53s system 99% cpu 5.254 total
./SumSeq +RTS -P 0.84s user 0.00s system 99% cpu 0.843 total
./SumPseq +RTS -P 2.19s user 0.22s system 99% cpu 2.408 total
Time with -O:
./SumNaive +RTS -P 0.58s user 0.00s system 99% cpu 0.584 total
./SumSeq +RTS -P 0.60s user 0.00s system 99% cpu 0.605 total
./SumPseq +RTS -P 1.91s user 0.24s system 99% cpu 2.147 total
Time with -O2:
./SumNaive +RTS -P 0.57s user 0.00s system 99% cpu 0.570 total
./SumSeq +RTS -P 0.61s user 0.01s system 99% cpu 0.621 total
./SumPseq +RTS -P 1.92s user 0.22s system 99% cpu 2.137 total
It may be seen that:
Naive variant has poor performance without optimizations, but excellent performance with either -O or -O2 -- to the extent that it outperforms all others.
seq variant has a good performance that's very little improved by optimizations, so that with either -O or -O2 the Naive variant outperforms it.
pseq variant has consistently poor performance, about twice better than Naive variant without optimization, and four times worse than others with either -O or -O2. Optimization affects it about as little as the seq variant.