Performance problem with Euler problem and recursion on Int64 types

后端未结

关注

 6  1513

[愿得一人] 2020-12-16 15:38

I\'m currently learning Haskell using the project Euler problems as my playground. I was astound by how slow my Haskell programs turned out to be compared to similar program

6条回答

无人及你 (楼主)

2020-12-16 16:31

There's a couple of interesting things in your question.

You should be using -O2 primarily. It will just do a better job (in this case, identifying and removing laziness that was still present in the -O version).

Secondly, your Haskell isn't quite the same as the Java (it does different tests and branches). As with others, running your code on my Linux box results in around 6s runtime. It seems fine.

Make sure it is the same as the Java

One idea: let's do a literal transcription of your Java, with the same control flow, operations and types.

import Data.Bits
import Data.Int

loop :: Int -> Int
loop n = go 0 (n-1) 0 0
    where
        go :: Int -> Int -> Int -> Int -> Int
        go x y acc norm2
            | x <= y        = case () of { _
                | norm2 < 0         -> go (x+1) y     acc     (norm2 + 2*x + 1)
                | norm2 > 2 * (n-1) -> go (x-1) (y-1) acc     (norm2 + 2 - 2 * (x+y))
                | otherwise         -> go (x+1) y     (acc+1) (norm2 + 2*x + 1)
            }
            | otherwise     = acc

main = print $ loop (1 `shiftL` 30)

Peek at the core

We'll take a quick peek at the Core, using ghc-core, and it shows a very nice loop of unboxed type:

main_$s$wgo
  :: Int#
     -> Int#
     -> Int#
     -> Int#
     -> Int#

main_$s$wgo =
  \ (sc_sQa :: Int#)
    (sc1_sQb :: Int#)
    (sc2_sQc :: Int#)
    (sc3_sQd :: Int#) ->
    case <=# sc3_sQd sc2_sQc of _ {
      False -> sc1_sQb;
      True ->
        case <# sc_sQa 0 of _ {
          False ->
            case ># sc_sQa 2147483646 of _ {
              False ->
                main_$s$wgo
                  (+# (+# sc_sQa (*# 2 sc3_sQd)) 1)
                  (+# sc1_sQb 1)
                  sc2_sQc
                      (+# sc3_sQd 1);
              True ->
                main_$s$wgo
                  (-#
                     (+# sc_sQa 2)
                     (*# 2 (+# sc3_sQd sc2_sQc)))
                  sc1_sQb
                  (-# sc2_sQc 1)
                  (-# sc3_sQd 1)
            };
          True ->
            main_$s$wgo
              (+# (+# sc_sQa (*# 2 sc3_sQd)) 1)
              sc1_sQb
              sc2_sQc
              (+# sc3_sQd 1)

that is, all unboxed into registers. That loop looks great!

And performs just fine (Linux/x86-64/GHC 7.03):

./A  5.95s user 0.01s system 99% cpu 5.980 total

Checking the asm

We get reasonable assembly too, as a nice loop:

Main_mainzuzdszdwgo_info:
        cmpq    %rdi, %r8
        jg      .L8
.L3:
        testq   %r14, %r14
        movq    %r14, %rdx
        js      .L4
        cmpq    $2147483646, %r14
        jle     .L9
.L5:
        leaq    (%rdi,%r8), %r10
        addq    $2, %rdx
        leaq    -1(%rdi), %rdi
        addq    %r10, %r10
        movq    %rdx, %r14
        leaq    -1(%r8), %r8
        subq    %r10, %r14
        jmp     Main_mainzuzdszdwgo_info
.L9:
        leaq    1(%r14,%r8,2), %r14
        addq    $1, %rsi
        leaq    1(%r8), %r8
        jmp     Main_mainzuzdszdwgo_info
.L8:
        movq    %rsi, %rbx
        jmp     *0(%rbp)
.L4:
        leaq    1(%r14,%r8,2), %r14
        leaq    1(%r8), %r8
        jmp     Main_mainzuzdszdwgo_info

Using the -fvia-C backend.

So this looks fine!

My suspicion, as mentioned in the comment above, is something to do with the version of libgmp you have on 32 bit Windows generating poor code for 64 bit ints. First try upgrading to GHC 7.0.3, and then try some of the other code generator backends, then if you still have an issue with Int64, file a bug report to GHC trac.

Broadly confirming that it is indeed the cost of making those C calls in the 32 bit emulation of 64 bit ints, we can replace Int64 with Integer, which is implemented with C calls to GMP on every machine, and indeed, runtime goes from 3s to well over a minute.

Lesson: use hardware 64 bits if at all possible.

0 讨论(0)

查看其它6个回答