Haskell math performance on multiply-add operation

前端 未结 2 1766
南方客
南方客 2021-01-30 18:05

I\'m writing a game in Haskell, and my current pass at the UI involves a lot of procedural generation of geometry. I am currently focused on identifying performance of one parti

2条回答
  •  轮回少年
    2021-01-30 18:23

    Well, this is better. 3.5s instead of 14s.

    {-# LANGUAGE BangPatterns #-}
    {-
    
    -- multiply-add of four floats,
    Vec4f multiplier, addend;
    Vec4f vecList[];
    for (int i = 0; i < count; i++)
        vecList[i] = vecList[i] * multiplier + addend;
    
    -}
    
    import qualified Data.Vector.Storable as V
    import Data.Vector.Storable (Vector)
    import Data.Bits
    
    repCount, arraySize :: Int
    repCount = 10000
    arraySize = 20000
    
    a, m :: Vector Float
    a = V.fromList [0.2,  0.1, 0.6, 1.0]
    m = V.fromList [0.99, 0.7, 0.8, 0.6]
    
    multAdd :: Int -> Float -> Float
    multAdd i v = v * (m `V.unsafeIndex` (i .&. 3)) + (a `V.unsafeIndex` (i .&. 3))
    
    go :: Int -> Vector Float -> Vector Float
    go n s
        | n <= 0    = s
        | otherwise = go (n-1) (f s)
      where
        f = V.imap multAdd
    
    main = print . V.sum $ go repCount v
      where
        v :: Vector Float
        v = V.replicate (arraySize * 4) 0
                -- ^ a flattened Vec4f []
    

    Which is better than it was:

    $ ghc -O2 --make A.hs
    [1 of 1] Compiling Main             ( A.hs, A.o )
    Linking A ...
    
    $ time ./A
    516748.13
    ./A  3.58s user 0.01s system 99% cpu 3.593 total
    

    multAdd compiles just fine:

            case readFloatOffAddr#
                   rb_aVn
                   (word2Int#
                      (and# (int2Word# sc1_s1Yx) __word 3))
                   realWorld#
            of _ { (# s25_X1Tb, x4_X1Te #) ->
            case readFloatOffAddr#
                   rb11_X118
                   (word2Int#
                      (and# (int2Word# sc1_s1Yx) __word 3))
                   realWorld#
            of _ { (# s26_X1WO, x5_X20B #) ->
            case writeFloatOffAddr#
                   @ RealWorld
                   a17_s1Oe
                   sc3_s1Yz
                   (plusFloat#
                      (timesFloat# x3_X1Qz x4_X1Te) x5_X20B)
    

    However, you're doing 4-element at a time multiplies in the C code, so we'll need to do that directly, rather than faking it by looping and masking. GCC is probably unrolling the loop, too.

    So to get identical performance, we'd need the vector multiply (a bit hard, possibly via the LLVM backend) and unroll the loop (possibly fusing it). I'll defer to Roman here to see if there's other obvious things.

    One idea might be to actually use a Vector Vec4, rather than flattening it.

提交回复
热议问题