Performance problem with Euler problem and recursion on Int64 types

后端 未结 6 1513
[愿得一人]
[愿得一人] 2020-12-16 15:38

I\'m currently learning Haskell using the project Euler problems as my playground. I was astound by how slow my Haskell programs turned out to be compared to similar program

6条回答
  •  无人及你
    2020-12-16 16:31

    There's a couple of interesting things in your question.

    You should be using -O2 primarily. It will just do a better job (in this case, identifying and removing laziness that was still present in the -O version).

    Secondly, your Haskell isn't quite the same as the Java (it does different tests and branches). As with others, running your code on my Linux box results in around 6s runtime. It seems fine.

    Make sure it is the same as the Java

    One idea: let's do a literal transcription of your Java, with the same control flow, operations and types.

    import Data.Bits
    import Data.Int
    
    loop :: Int -> Int
    loop n = go 0 (n-1) 0 0
        where
            go :: Int -> Int -> Int -> Int -> Int
            go x y acc norm2
                | x <= y        = case () of { _
                    | norm2 < 0         -> go (x+1) y     acc     (norm2 + 2*x + 1)
                    | norm2 > 2 * (n-1) -> go (x-1) (y-1) acc     (norm2 + 2 - 2 * (x+y))
                    | otherwise         -> go (x+1) y     (acc+1) (norm2 + 2*x + 1)
                }
                | otherwise     = acc
    
    main = print $ loop (1 `shiftL` 30)
    

    Peek at the core

    We'll take a quick peek at the Core, using ghc-core, and it shows a very nice loop of unboxed type:

    main_$s$wgo
      :: Int#
         -> Int#
         -> Int#
         -> Int#
         -> Int#
    
    main_$s$wgo =
      \ (sc_sQa :: Int#)
        (sc1_sQb :: Int#)
        (sc2_sQc :: Int#)
        (sc3_sQd :: Int#) ->
        case <=# sc3_sQd sc2_sQc of _ {
          False -> sc1_sQb;
          True ->
            case <# sc_sQa 0 of _ {
              False ->
                case ># sc_sQa 2147483646 of _ {
                  False ->
                    main_$s$wgo
                      (+# (+# sc_sQa (*# 2 sc3_sQd)) 1)
                      (+# sc1_sQb 1)
                      sc2_sQc
                          (+# sc3_sQd 1);
                  True ->
                    main_$s$wgo
                      (-#
                         (+# sc_sQa 2)
                         (*# 2 (+# sc3_sQd sc2_sQc)))
                      sc1_sQb
                      (-# sc2_sQc 1)
                      (-# sc3_sQd 1)
                };
              True ->
                main_$s$wgo
                  (+# (+# sc_sQa (*# 2 sc3_sQd)) 1)
                  sc1_sQb
                  sc2_sQc
                  (+# sc3_sQd 1)
    

    that is, all unboxed into registers. That loop looks great!

    And performs just fine (Linux/x86-64/GHC 7.03):

    ./A  5.95s user 0.01s system 99% cpu 5.980 total
    

    Checking the asm

    We get reasonable assembly too, as a nice loop:

    Main_mainzuzdszdwgo_info:
            cmpq    %rdi, %r8
            jg      .L8
    .L3:
            testq   %r14, %r14
            movq    %r14, %rdx
            js      .L4
            cmpq    $2147483646, %r14
            jle     .L9
    .L5:
            leaq    (%rdi,%r8), %r10
            addq    $2, %rdx
            leaq    -1(%rdi), %rdi
            addq    %r10, %r10
            movq    %rdx, %r14
            leaq    -1(%r8), %r8
            subq    %r10, %r14
            jmp     Main_mainzuzdszdwgo_info
    .L9:
            leaq    1(%r14,%r8,2), %r14
            addq    $1, %rsi
            leaq    1(%r8), %r8
            jmp     Main_mainzuzdszdwgo_info
    .L8:
            movq    %rsi, %rbx
            jmp     *0(%rbp)
    .L4:
            leaq    1(%r14,%r8,2), %r14
            leaq    1(%r8), %r8
            jmp     Main_mainzuzdszdwgo_info
    

    Using the -fvia-C backend.

    So this looks fine!


    My suspicion, as mentioned in the comment above, is something to do with the version of libgmp you have on 32 bit Windows generating poor code for 64 bit ints. First try upgrading to GHC 7.0.3, and then try some of the other code generator backends, then if you still have an issue with Int64, file a bug report to GHC trac.

    Broadly confirming that it is indeed the cost of making those C calls in the 32 bit emulation of 64 bit ints, we can replace Int64 with Integer, which is implemented with C calls to GMP on every machine, and indeed, runtime goes from 3s to well over a minute.

    Lesson: use hardware 64 bits if at all possible.

提交回复
热议问题