Performance problem with Euler problem and recursion on Int64 types

后端未结

关注

 6  1510

I\'m currently learning Haskell using the project Euler problems as my playground. I was astound by how slow my Haskell programs turned out to be compared to similar program

相关标签:

6条回答

轮回少年

2020-12-16 16:06

I've played with the code a little and this version seems to run faster than Java version on my laptop (3.55s vs 4.63s):

{-# LANGUAGE BangPatterns #-}

arcLength :: Int->Int
arcLength n = arcLength' 0 (n-1) 0 0 where
    arcLength' :: Int -> Int -> Int -> Int -> Int
    arcLength' !x !y !norm2 !acc
        | x > y = acc
        | norm2 > 2*(n-1) = arcLength' (x - 1) (y - 1) (norm2 - 2*(x + y) + 2) acc
        | norm2 < 0 = arcLength' (succ x) y (norm2 + x*2 + 1) acc
        | otherwise = arcLength' (succ x) y (norm2 + 2*x + 1) (acc + 1)      

main = print $ arcLength (2^30)

$ ghc -O2 tmp1.hs -fforce-recomp
[1 of 1] Compiling Main             ( tmp1.hs, tmp1.o )
Linking tmp1 ...

$ time ./tmp1
843298604

real    0m3.553s
user    0m3.539s
sys 0m0.006s

0 讨论(0)

鱼传尺愫

2020-12-16 16:14
dberg, I feel like all of this got off to a bad start with the unfortunate -O flag. Just to emphasize a point made by others, for run-of-the-mill compilation and testing, do like me and paste this into your .bashrc or whatever:
```
alias ggg="ghc --make -O2"
alias gggg="echo 'Glorious Glasgow for Great Good!' && ghc --make -O2 --fforce-recomp"
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
天涯浪人

2020-12-16 16:26
Hm, I installed a fresh Haskell platform with 7.0.3, and get roughly the following core for your program (-ddump-simpl):
```
Main.$warcLength' =
  \ (ww_s1my :: GHC.Prim.Int64#) (ww1_s1mC :: GHC.Prim.Int64#)
    (ww2_s1mG :: GHC.Prim.Int64#) (ww3_s1mK :: GHC.Prim.Int64#) ->
    case {__pkg_ccall ghc-prim hs_gtInt64 [...]
           ww_s1my ww1_s1mC GHC.Prim.realWorld#
[...]
```
So GHC has realized that it can unpack your integers, which is good. But this hs_getInt64 call looks suspiciously like a C call. Looking at the assembler output (-ddump-asm), we see stuff like:
```
pushl %eax
movl 76(%esp),%eax
pushl %eax
call _hs_gtInt64
addl $16,%esp
```
So this looks very much like every operation on the Int64 get turned into a full-blown C call in the backend. Which is slow, obviously.

The source code of GHC.IntWord64 seems to verify that: In a 32-bit build (like the one currently shipped with the platform), you will have only emulation via the FFI interface.
0 讨论(0)
发布评论:

提交评论
- 加载中...

梦谈多话

2020-12-16 16:28

Hmm, this is interesting. So I just compiled both of your programs, and tried them out:

% java -version                                                                                          
java version "1.6.0_18"
OpenJDK Runtime Environment (IcedTea6 1.8.7) (6b18-1.8.7-2~squeeze1)
OpenJDK 64-Bit Server VM (build 14.0-b16, mixed mode)
% javac ArcLength.java                                                                                   
% java ArcLength                                                                                         
843298604
6630

So about 6.6 seconds for the Java solution. Next is ghc with some optimization:

% ghc --version                                                                                          
The Glorious Glasgow Haskell Compilation System, version 6.12.1
% ghc --make -O arc.hs
% time ./arc                                                                                             
843298604
./arc  12.68s user 0.04s system 99% cpu 12.718 total

Just under 13 seconds for ghc -O

Trying with some further optimization:

% ghc --make -O3
% time ./arc                                                                                             [13:16]
843298604
./arc  5.75s user 0.00s system 99% cpu 5.754 total

With further optimization flags, the haskell solution took under 6 seconds

It would be interesting to know what version compiler you are using.

0 讨论(0)

无人及你

2020-12-16 16:31

There's a couple of interesting things in your question.

You should be using -O2 primarily. It will just do a better job (in this case, identifying and removing laziness that was still present in the -O version).

Secondly, your Haskell isn't quite the same as the Java (it does different tests and branches). As with others, running your code on my Linux box results in around 6s runtime. It seems fine.

Make sure it is the same as the Java

One idea: let's do a literal transcription of your Java, with the same control flow, operations and types.

import Data.Bits
import Data.Int

loop :: Int -> Int
loop n = go 0 (n-1) 0 0
    where
        go :: Int -> Int -> Int -> Int -> Int
        go x y acc norm2
            | x <= y        = case () of { _
                | norm2 < 0         -> go (x+1) y     acc     (norm2 + 2*x + 1)
                | norm2 > 2 * (n-1) -> go (x-1) (y-1) acc     (norm2 + 2 - 2 * (x+y))
                | otherwise         -> go (x+1) y     (acc+1) (norm2 + 2*x + 1)
            }
            | otherwise     = acc

main = print $ loop (1 `shiftL` 30)

Peek at the core

We'll take a quick peek at the Core, using ghc-core, and it shows a very nice loop of unboxed type:

main_$s$wgo
  :: Int#
     -> Int#
     -> Int#
     -> Int#
     -> Int#

main_$s$wgo =
  \ (sc_sQa :: Int#)
    (sc1_sQb :: Int#)
    (sc2_sQc :: Int#)
    (sc3_sQd :: Int#) ->
    case <=# sc3_sQd sc2_sQc of _ {
      False -> sc1_sQb;
      True ->
        case <# sc_sQa 0 of _ {
          False ->
            case ># sc_sQa 2147483646 of _ {
              False ->
                main_$s$wgo
                  (+# (+# sc_sQa (*# 2 sc3_sQd)) 1)
                  (+# sc1_sQb 1)
                  sc2_sQc
                      (+# sc3_sQd 1);
              True ->
                main_$s$wgo
                  (-#
                     (+# sc_sQa 2)
                     (*# 2 (+# sc3_sQd sc2_sQc)))
                  sc1_sQb
                  (-# sc2_sQc 1)
                  (-# sc3_sQd 1)
            };
          True ->
            main_$s$wgo
              (+# (+# sc_sQa (*# 2 sc3_sQd)) 1)
              sc1_sQb
              sc2_sQc
              (+# sc3_sQd 1)

that is, all unboxed into registers. That loop looks great!

And performs just fine (Linux/x86-64/GHC 7.03):

./A  5.95s user 0.01s system 99% cpu 5.980 total

Checking the asm

We get reasonable assembly too, as a nice loop:

Main_mainzuzdszdwgo_info:
        cmpq    %rdi, %r8
        jg      .L8
.L3:
        testq   %r14, %r14
        movq    %r14, %rdx
        js      .L4
        cmpq    $2147483646, %r14
        jle     .L9
.L5:
        leaq    (%rdi,%r8), %r10
        addq    $2, %rdx
        leaq    -1(%rdi), %rdi
        addq    %r10, %r10
        movq    %rdx, %r14
        leaq    -1(%r8), %r8
        subq    %r10, %r14
        jmp     Main_mainzuzdszdwgo_info
.L9:
        leaq    1(%r14,%r8,2), %r14
        addq    $1, %rsi
        leaq    1(%r8), %r8
        jmp     Main_mainzuzdszdwgo_info
.L8:
        movq    %rsi, %rbx
        jmp     *0(%rbp)
.L4:
        leaq    1(%r14,%r8,2), %r14
        leaq    1(%r8), %r8
        jmp     Main_mainzuzdszdwgo_info

Using the -fvia-C backend.

So this looks fine!

My suspicion, as mentioned in the comment above, is something to do with the version of libgmp you have on 32 bit Windows generating poor code for 64 bit ints. First try upgrading to GHC 7.0.3, and then try some of the other code generator backends, then if you still have an issue with Int64, file a bug report to GHC trac.

Broadly confirming that it is indeed the cost of making those C calls in the 32 bit emulation of 64 bit ints, we can replace Int64 with Integer, which is implemented with C calls to GMP on every machine, and indeed, runtime goes from 3s to well over a minute.

Lesson: use hardware 64 bits if at all possible.

0 讨论(0)

猫巷女王i

2020-12-16 16:31

The normal optimization flag for performance concerned code is -O2. What you used, -O, does very little. -O3 doesn't do much (any?) more than -O2 - it even used to include experimental "optimizations" that often made programs notably slower.

With -O2 I get performance competitive with Java:

tommd@Mavlo:Test$ uname -r -m
2.6.37 x86_64
tommd@Mavlo:Test$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 7.0.3

tommd@Mavlo:Test$ ghc -O2 so.hs
[1 of 1] Compiling Main             ( so.hs, so.o )
Linking so ...
tommd@Mavlo:Test$ time ./so
843298604

real    0m4.948s
user    0m4.896s
sys     0m0.000s

And Java is about 1 second faster (20%):

tommd@Mavlo:Test$ time java ArcLength
843298604
3880

real    0m3.961s
user    0m3.936s
sys     0m0.024s

But an interesting thing about GHC is it has many different backends. By default it uses the native code generator (NCG), which we timed above. There's also an LLVM backend that often does better... but not here:

tommd@Mavlo:Test$ ghc -O2 so.hs -fllvm -fforce-recomp
[1 of 1] Compiling Main             ( so.hs, so.o )
Linking so ...
tommd@Mavlo:Test$ time ./so
843298604

real    0m5.973s
user    0m5.968s
sys     0m0.000s

But, as FUZxxl mentioned in the comments, LLVM does much better when you add a few strictness annotations:

$ ghc -O2 -fllvm -fforce-recomp so.hs
[1 of 1] Compiling Main             ( so.hs, so.o )
Linking so ...
tommd@Mavlo:Test$ time ./so
843298604

real    0m4.099s
user    0m4.088s
sys     0m0.000s

There's also an old "via-c" generator that uses C as an intermediate language. It does well in this case:

tommd@Mavlo:Test$ ghc -O2 so.hs -fvia-c -fforce-recomp
[1 of 1] Compiling Main             ( so.hs, so.o )

on the commandline:
    Warning: The -fvia-c flag will be removed in a future GHC release
Linking so ...
ttommd@Mavlo:Test$ ti
tommd@Mavlo:Test$ time ./so
843298604

real    0m3.982s
user    0m3.972s
sys     0m0.000s

Hopefully the NCG will be improved to match via-c for this case before they remove this backend.

0 讨论(0)