Performance problem with Euler problem and recursion on Int64 types

后端 未结 6 1497
[愿得一人]
[愿得一人] 2020-12-16 15:38

I\'m currently learning Haskell using the project Euler problems as my playground. I was astound by how slow my Haskell programs turned out to be compared to similar program

相关标签:
6条回答
  • 2020-12-16 16:06

    I've played with the code a little and this version seems to run faster than Java version on my laptop (3.55s vs 4.63s):

    {-# LANGUAGE BangPatterns #-}
    
    arcLength :: Int->Int
    arcLength n = arcLength' 0 (n-1) 0 0 where
        arcLength' :: Int -> Int -> Int -> Int -> Int
        arcLength' !x !y !norm2 !acc
            | x > y = acc
            | norm2 > 2*(n-1) = arcLength' (x - 1) (y - 1) (norm2 - 2*(x + y) + 2) acc
            | norm2 < 0 = arcLength' (succ x) y (norm2 + x*2 + 1) acc
            | otherwise = arcLength' (succ x) y (norm2 + 2*x + 1) (acc + 1)      
    
    main = print $ arcLength (2^30)
    

    :

    $ ghc -O2 tmp1.hs -fforce-recomp
    [1 of 1] Compiling Main             ( tmp1.hs, tmp1.o )
    Linking tmp1 ...
    
    $ time ./tmp1
    843298604
    
    real    0m3.553s
    user    0m3.539s
    sys 0m0.006s
    
    0 讨论(0)
  • 2020-12-16 16:14

    dberg, I feel like all of this got off to a bad start with the unfortunate -O flag. Just to emphasize a point made by others, for run-of-the-mill compilation and testing, do like me and paste this into your .bashrc or whatever:

    alias ggg="ghc --make -O2"
    alias gggg="echo 'Glorious Glasgow for Great Good!' && ghc --make -O2 --fforce-recomp"
    
    0 讨论(0)
  • 2020-12-16 16:26

    Hm, I installed a fresh Haskell platform with 7.0.3, and get roughly the following core for your program (-ddump-simpl):

    Main.$warcLength' =
      \ (ww_s1my :: GHC.Prim.Int64#) (ww1_s1mC :: GHC.Prim.Int64#)
        (ww2_s1mG :: GHC.Prim.Int64#) (ww3_s1mK :: GHC.Prim.Int64#) ->
        case {__pkg_ccall ghc-prim hs_gtInt64 [...]
               ww_s1my ww1_s1mC GHC.Prim.realWorld#
    [...]
    

    So GHC has realized that it can unpack your integers, which is good. But this hs_getInt64 call looks suspiciously like a C call. Looking at the assembler output (-ddump-asm), we see stuff like:

    pushl %eax
    movl 76(%esp),%eax
    pushl %eax
    call _hs_gtInt64
    addl $16,%esp
    

    So this looks very much like every operation on the Int64 get turned into a full-blown C call in the backend. Which is slow, obviously.

    The source code of GHC.IntWord64 seems to verify that: In a 32-bit build (like the one currently shipped with the platform), you will have only emulation via the FFI interface.

    0 讨论(0)
  • 2020-12-16 16:28

    Hmm, this is interesting. So I just compiled both of your programs, and tried them out:

    % java -version                                                                                          
    java version "1.6.0_18"
    OpenJDK Runtime Environment (IcedTea6 1.8.7) (6b18-1.8.7-2~squeeze1)
    OpenJDK 64-Bit Server VM (build 14.0-b16, mixed mode)
    % javac ArcLength.java                                                                                   
    % java ArcLength                                                                                         
    843298604
    6630
    

    So about 6.6 seconds for the Java solution. Next is ghc with some optimization:

    % ghc --version                                                                                          
    The Glorious Glasgow Haskell Compilation System, version 6.12.1
    % ghc --make -O arc.hs
    % time ./arc                                                                                             
    843298604
    ./arc  12.68s user 0.04s system 99% cpu 12.718 total
    

    Just under 13 seconds for ghc -O

    Trying with some further optimization:

    % ghc --make -O3
    % time ./arc                                                                                             [13:16]
    843298604
    ./arc  5.75s user 0.00s system 99% cpu 5.754 total
    

    With further optimization flags, the haskell solution took under 6 seconds

    It would be interesting to know what version compiler you are using.

    0 讨论(0)
  • 2020-12-16 16:31

    There's a couple of interesting things in your question.

    You should be using -O2 primarily. It will just do a better job (in this case, identifying and removing laziness that was still present in the -O version).

    Secondly, your Haskell isn't quite the same as the Java (it does different tests and branches). As with others, running your code on my Linux box results in around 6s runtime. It seems fine.

    Make sure it is the same as the Java

    One idea: let's do a literal transcription of your Java, with the same control flow, operations and types.

    import Data.Bits
    import Data.Int
    
    loop :: Int -> Int
    loop n = go 0 (n-1) 0 0
        where
            go :: Int -> Int -> Int -> Int -> Int
            go x y acc norm2
                | x <= y        = case () of { _
                    | norm2 < 0         -> go (x+1) y     acc     (norm2 + 2*x + 1)
                    | norm2 > 2 * (n-1) -> go (x-1) (y-1) acc     (norm2 + 2 - 2 * (x+y))
                    | otherwise         -> go (x+1) y     (acc+1) (norm2 + 2*x + 1)
                }
                | otherwise     = acc
    
    main = print $ loop (1 `shiftL` 30)
    

    Peek at the core

    We'll take a quick peek at the Core, using ghc-core, and it shows a very nice loop of unboxed type:

    main_$s$wgo
      :: Int#
         -> Int#
         -> Int#
         -> Int#
         -> Int#
    
    main_$s$wgo =
      \ (sc_sQa :: Int#)
        (sc1_sQb :: Int#)
        (sc2_sQc :: Int#)
        (sc3_sQd :: Int#) ->
        case <=# sc3_sQd sc2_sQc of _ {
          False -> sc1_sQb;
          True ->
            case <# sc_sQa 0 of _ {
              False ->
                case ># sc_sQa 2147483646 of _ {
                  False ->
                    main_$s$wgo
                      (+# (+# sc_sQa (*# 2 sc3_sQd)) 1)
                      (+# sc1_sQb 1)
                      sc2_sQc
                          (+# sc3_sQd 1);
                  True ->
                    main_$s$wgo
                      (-#
                         (+# sc_sQa 2)
                         (*# 2 (+# sc3_sQd sc2_sQc)))
                      sc1_sQb
                      (-# sc2_sQc 1)
                      (-# sc3_sQd 1)
                };
              True ->
                main_$s$wgo
                  (+# (+# sc_sQa (*# 2 sc3_sQd)) 1)
                  sc1_sQb
                  sc2_sQc
                  (+# sc3_sQd 1)
    

    that is, all unboxed into registers. That loop looks great!

    And performs just fine (Linux/x86-64/GHC 7.03):

    ./A  5.95s user 0.01s system 99% cpu 5.980 total
    

    Checking the asm

    We get reasonable assembly too, as a nice loop:

    Main_mainzuzdszdwgo_info:
            cmpq    %rdi, %r8
            jg      .L8
    .L3:
            testq   %r14, %r14
            movq    %r14, %rdx
            js      .L4
            cmpq    $2147483646, %r14
            jle     .L9
    .L5:
            leaq    (%rdi,%r8), %r10
            addq    $2, %rdx
            leaq    -1(%rdi), %rdi
            addq    %r10, %r10
            movq    %rdx, %r14
            leaq    -1(%r8), %r8
            subq    %r10, %r14
            jmp     Main_mainzuzdszdwgo_info
    .L9:
            leaq    1(%r14,%r8,2), %r14
            addq    $1, %rsi
            leaq    1(%r8), %r8
            jmp     Main_mainzuzdszdwgo_info
    .L8:
            movq    %rsi, %rbx
            jmp     *0(%rbp)
    .L4:
            leaq    1(%r14,%r8,2), %r14
            leaq    1(%r8), %r8
            jmp     Main_mainzuzdszdwgo_info
    

    Using the -fvia-C backend.

    So this looks fine!


    My suspicion, as mentioned in the comment above, is something to do with the version of libgmp you have on 32 bit Windows generating poor code for 64 bit ints. First try upgrading to GHC 7.0.3, and then try some of the other code generator backends, then if you still have an issue with Int64, file a bug report to GHC trac.

    Broadly confirming that it is indeed the cost of making those C calls in the 32 bit emulation of 64 bit ints, we can replace Int64 with Integer, which is implemented with C calls to GMP on every machine, and indeed, runtime goes from 3s to well over a minute.

    Lesson: use hardware 64 bits if at all possible.

    0 讨论(0)
  • 2020-12-16 16:31

    The normal optimization flag for performance concerned code is -O2. What you used, -O, does very little. -O3 doesn't do much (any?) more than -O2 - it even used to include experimental "optimizations" that often made programs notably slower.

    With -O2 I get performance competitive with Java:

    tommd@Mavlo:Test$ uname -r -m
    2.6.37 x86_64
    tommd@Mavlo:Test$ ghc --version
    The Glorious Glasgow Haskell Compilation System, version 7.0.3
    
    tommd@Mavlo:Test$ ghc -O2 so.hs
    [1 of 1] Compiling Main             ( so.hs, so.o )
    Linking so ...
    tommd@Mavlo:Test$ time ./so
    843298604
    
    real    0m4.948s
    user    0m4.896s
    sys     0m0.000s
    

    And Java is about 1 second faster (20%):

    tommd@Mavlo:Test$ time java ArcLength
    843298604
    3880
    
    real    0m3.961s
    user    0m3.936s
    sys     0m0.024s
    

    But an interesting thing about GHC is it has many different backends. By default it uses the native code generator (NCG), which we timed above. There's also an LLVM backend that often does better... but not here:

    tommd@Mavlo:Test$ ghc -O2 so.hs -fllvm -fforce-recomp
    [1 of 1] Compiling Main             ( so.hs, so.o )
    Linking so ...
    tommd@Mavlo:Test$ time ./so
    843298604
    
    real    0m5.973s
    user    0m5.968s
    sys     0m0.000s
    

    But, as FUZxxl mentioned in the comments, LLVM does much better when you add a few strictness annotations:

    $ ghc -O2 -fllvm -fforce-recomp so.hs
    [1 of 1] Compiling Main             ( so.hs, so.o )
    Linking so ...
    tommd@Mavlo:Test$ time ./so
    843298604
    
    real    0m4.099s
    user    0m4.088s
    sys     0m0.000s
    

    There's also an old "via-c" generator that uses C as an intermediate language. It does well in this case:

    tommd@Mavlo:Test$ ghc -O2 so.hs -fvia-c -fforce-recomp
    [1 of 1] Compiling Main             ( so.hs, so.o )
    
    on the commandline:
        Warning: The -fvia-c flag will be removed in a future GHC release
    Linking so ...
    ttommd@Mavlo:Test$ ti
    tommd@Mavlo:Test$ time ./so
    843298604
    
    real    0m3.982s
    user    0m3.972s
    sys     0m0.000s
    

    Hopefully the NCG will be improved to match via-c for this case before they remove this backend.

    0 讨论(0)
提交回复
热议问题