I\'m currently learning Haskell using the project Euler problems as my playground. I was astound by how slow my Haskell programs turned out to be compared to similar program
There's a couple of interesting things in your question.
You should be using -O2 primarily. It will just do a better job (in this case, identifying and removing laziness that was still present in the -O version).
Secondly, your Haskell isn't quite the same as the Java (it does different tests and branches). As with others, running your code on my Linux box results in around 6s runtime. It seems fine.
Make sure it is the same as the Java
One idea: let's do a literal transcription of your Java, with the same control flow, operations and types.
import Data.Bits
import Data.Int
loop :: Int -> Int
loop n = go 0 (n-1) 0 0
where
go :: Int -> Int -> Int -> Int -> Int
go x y acc norm2
| x <= y = case () of { _
| norm2 < 0 -> go (x+1) y acc (norm2 + 2*x + 1)
| norm2 > 2 * (n-1) -> go (x-1) (y-1) acc (norm2 + 2 - 2 * (x+y))
| otherwise -> go (x+1) y (acc+1) (norm2 + 2*x + 1)
}
| otherwise = acc
main = print $ loop (1 `shiftL` 30)
Peek at the core
We'll take a quick peek at the Core, using ghc-core, and it shows a very nice loop of unboxed type:
main_$s$wgo
:: Int#
-> Int#
-> Int#
-> Int#
-> Int#
main_$s$wgo =
\ (sc_sQa :: Int#)
(sc1_sQb :: Int#)
(sc2_sQc :: Int#)
(sc3_sQd :: Int#) ->
case <=# sc3_sQd sc2_sQc of _ {
False -> sc1_sQb;
True ->
case <# sc_sQa 0 of _ {
False ->
case ># sc_sQa 2147483646 of _ {
False ->
main_$s$wgo
(+# (+# sc_sQa (*# 2 sc3_sQd)) 1)
(+# sc1_sQb 1)
sc2_sQc
(+# sc3_sQd 1);
True ->
main_$s$wgo
(-#
(+# sc_sQa 2)
(*# 2 (+# sc3_sQd sc2_sQc)))
sc1_sQb
(-# sc2_sQc 1)
(-# sc3_sQd 1)
};
True ->
main_$s$wgo
(+# (+# sc_sQa (*# 2 sc3_sQd)) 1)
sc1_sQb
sc2_sQc
(+# sc3_sQd 1)
that is, all unboxed into registers. That loop looks great!
And performs just fine (Linux/x86-64/GHC 7.03):
./A 5.95s user 0.01s system 99% cpu 5.980 total
Checking the asm
We get reasonable assembly too, as a nice loop:
Main_mainzuzdszdwgo_info:
cmpq %rdi, %r8
jg .L8
.L3:
testq %r14, %r14
movq %r14, %rdx
js .L4
cmpq $2147483646, %r14
jle .L9
.L5:
leaq (%rdi,%r8), %r10
addq $2, %rdx
leaq -1(%rdi), %rdi
addq %r10, %r10
movq %rdx, %r14
leaq -1(%r8), %r8
subq %r10, %r14
jmp Main_mainzuzdszdwgo_info
.L9:
leaq 1(%r14,%r8,2), %r14
addq $1, %rsi
leaq 1(%r8), %r8
jmp Main_mainzuzdszdwgo_info
.L8:
movq %rsi, %rbx
jmp *0(%rbp)
.L4:
leaq 1(%r14,%r8,2), %r14
leaq 1(%r8), %r8
jmp Main_mainzuzdszdwgo_info
Using the -fvia-C backend.
So this looks fine!
My suspicion, as mentioned in the comment above, is something to do with the version of libgmp you have on 32 bit Windows generating poor code for 64 bit ints. First try upgrading to GHC 7.0.3, and then try some of the other code generator backends, then if you still have an issue with Int64, file a bug report to GHC trac.
Broadly confirming that it is indeed the cost of making those C calls in the 32 bit emulation of 64 bit ints, we can replace Int64 with Integer, which is implemented with C calls to GMP on every machine, and indeed, runtime goes from 3s to well over a minute.
Lesson: use hardware 64 bits if at all possible.