Haskell: how to detect “lazy memory leaks”

问题

After few hours of debugging, I realized that a very simple toy example was not efficient due to a missing ! in an expression return $ 1 + x (thanks duplode!... but how come ghc does not optimize that??). I also realized it because I was comparing it with a Python code that was quicker, but I won't always write Python code to benchmark my code...

So here is my question: is there a way to automatically detect these "lazy memory leaks", that slow down a program for no real reason? I'm still pretty bad to optimize Haskell code, and forgetting a ! is quite likely, even when you're experimented I guess.

I'm aware of:

the +RTS -s, but I'm not sure how to interpret it: seeing 79MB of memory for a simple program seems huge to me for example, but maybe it's not as it's what takes my current program... and for bigger programs it's impossible to just detect "lazy leaks" that way I guess as I have no idea of the amount of memory my program should take.
the cabal v2-run --enable-profiling mysatsolvers -- +RTS -p command, but it seems that enabling the profiler kills some optimisations done by GHC, and therefore it's hard to use these values for a real benchmark. And still, it's not clear to me how to find leaks from that output anyway.

Could you for example explain me how I could find the "lazy leaks" in a toy program like this one?

{-# LANGUAGE DerivingVia, FlexibleInstances, ScopedTypeVariables #-}
module Main where

--- It depends on the transformers, containers, and base packages.
--- Optimisation seems to be important or the NoLog case will be way to long.
--- $ ghc -O Main.hs

import qualified Data.Map.Strict as MapStrict
import Data.Functor.Identity

import qualified Control.Monad as CM
import qualified Control.Monad.State.Strict as State
import qualified Data.Time as Time

-- Create a class that allows me to use the function "myTell"
-- that adds a number in the writer (either the LogEntry
-- or StupidLogEntry one)
class Monad m => LogFunctionCalls m where
  myTell :: String -> Int -> m ()

---------- Logging disabled ----------
--- (No logging at all gives the same time so I don't put here)
newtype NoLog a = NoLog { unNoLog :: a }
  deriving (Functor, Applicative, Monad) via Identity

instance LogFunctionCalls NoLog where
  myTell _ _ = pure ()

---------- Logging with Map ----------
-- When logging, associate a number to each name.
newtype LogEntryMap = LogEntryMap (MapStrict.Map String Int)
  deriving (Eq, Show)

instance LogFunctionCalls (State.State LogEntryMap) where
  myTell namefunction n = State.modify' $
    \(LogEntryMap m) ->
      LogEntryMap $ MapStrict.insertWith (+) namefunction n m

---------- Logging with Int ----------
-- Don't use any Map to avoid inefficiency of Map
newtype LogEntryInt = LogEntryInt Int
  deriving (Eq, Show)

instance LogFunctionCalls (State.State LogEntryInt) where
  myTell namefunction n = State.modify' $
    \(LogEntryInt m) -> LogEntryInt $! m + n

---------- Function to compute ----------
countNumberCalls :: (LogFunctionCalls m) => Int -> m Int
countNumberCalls 0 = return 0
countNumberCalls n = do
  myTell "countNumberCalls" 1
  x <- countNumberCalls $! n - 1
  return $ 1 + x

main :: IO ()
main = do
  let www = 15000000
  putStrLn $ "Let's start!"
  --- Logging disabled
  t0 <- Time.getCurrentTime
  let n = unNoLog $ countNumberCalls www
  putStrLn $ "Logging disabled: " ++ (show n)
  t1 <- Time.getCurrentTime
  print (Time.diffUTCTime t1 t0)
  -- Logging with Map
  let (n, LogEntryMap log) = State.runState (countNumberCalls www) (LogEntryMap MapStrict.empty)
  putStrLn $ "Logging with Map: " ++ (show n)
  putStrLn $ (show $ log)
  t2 <- Time.getCurrentTime
  print (Time.diffUTCTime t2 t1)
  -- Logging with Int
  let (n, LogEntryInt log) = State.runState (countNumberCalls www) (LogEntryInt 0)
  putStrLn $ "Logging with Int: " ++ (show n)
  putStrLn $ (show $ log)
  t3 <- Time.getCurrentTime
  print (Time.diffUTCTime t3 t2)

回答1:

The main method for detecting memory leaks is heap profiling. Specifically, you're looking for unexpected growth in the amount of resident (mostly heap) memory, either the maximum residency in the +RTS -s statistics output, or -- more reliably -- a characteristic "pyramid" shape over time in heap profile output generated with the +RTS -h<x> flags and the hp2ps tool.

If I run your toy program with +RTS -s, I see:

   3,281,896,520 bytes allocated in the heap
   3,383,195,568 bytes copied during GC
     599,346,304 bytes maximum residency (17 sample(s))
       5,706,584 bytes maximum slop
             571 MB total memory in use (0 MB lost due to fragmentation)

The first line can generally be ignored. Haskell programs typically allocate a roughly constant amount of memory per second of runtime, and this allocation rate is either nearly zero (for certain, unusual programs), or 0.5-2.0 gigabytes per second. This program ran for 4 seconds and allocated 3.8 gigabytes, and that's not unusual.

The bytes copied during GC and maximum residency are concerning, though. Assuming you have a program that you expect to run in constant space (i.e., there's no ever-growing data structure whose entire contents are needed), a correctly functioning Haskell program will generally not need to copy much data during garbage collection and will tend to have a maximum residency that's a small fraction of the total bytes allocated (e.g., 100 kilobytes rather than half a gigabyte), and this won't grow substantially with the number of "iterations" of whatever it is you're testing.

You can generate a quick heap profile over time without turning on formal profiling. If you compile with the GHC flag -rtsopts, you can use:

./Toy +RTS -hT

and then display the result graphically using the hp2ps tool:

hp2ps -c -e8in Toy.hp
evince Toy.ps &

This sort of pyramid pattern is a red flag:

Note that rapid linear increase in heap to the tune of hundreds of megabytes per second followed by a rapid linear collapse. This is the pattern you see when a huge lazy data structure is being needlessly built up before the entire computation is forced all at once. You see two pyramids here because both your second and third tests are exhibiting memory leaks.

As an aside, the x-axis is in "MUT seconds" (seconds the "mutator" is running, which excludes garbage collection), so that's why this is less than the actual 4 second runtime. That's actually another red flag. A Haskell program that's spending half its time garbage collecting probably isn't running correctly.

To get more detail on what's causing this heap pyramid, you'll need to compile with profiling enabled. Profiling may cause a program to run somewhat slower but doesn't normally change which optimizations are in place. However, the flag -fprof-auto (and related flags) which automatically insert cost centers have the potential of causing big performance changes (by interfering with inlining, etc.). Unfortunately, the cabal --enable-profiling flag turns on profiling (compiler flag -prof) and the flag -fprof-auto-top which automatically generates cost centers for top-level functions, so for your toy example, that substantially changes the behavior of your first test case (increasing the runtime from 0.4 seconds to 5 seconds, even with no +RTS flags). That may be the problem you're seeing with profiling affecting your results. You don't need any cost centers for several additional kinds of heap profiles, so you can add the cabal flag --profiling-detail=none to shut that off, and then your profiled program should run with timing a little slower but generally similar performance to the unprofiled version.

I don't use Cabal, but compiling with the following (which should be the equivalent of --enable-profiling --profiling-detail=none):

ghc -O2 -rtsopts -prof Toy.hs    # no -fprof-auto...

I can run your program with profiling by data type:

./Toy +RTS -hy

If I look at the heap profile graph:

this attributes most of the heap to the Int type -- this narrows my problem down to a bunch of unevaluated lazy Int calculations, which might point me in the right direction.

If I'm really having trouble narrowing things down and am feeling like a technical deep-dive, I can also run a heap profile by closure (flag -hd). This tells me that the culprits are Main.sat_s7mQ and Main.sat_s7kP for the two pyramids respectively. This looks very mysterious, but they're the names of functions in the "STG", a low-level intermediate representation of my program generated by the compiler.

If I recompile with the same flags but add -fforce-recomp -ddump-stg -dsuppress-all:

ghc -O2 -rtsopts -prof -fforce-recomp -ddump-stg -dsuppress-all Toy.hs

this will dump the STG that contains the definitions of these two functions. (The generated identifiers can differ with small changes to code and/or compiler flags, so it's best to recompile with the STG dumped and then re-profile that executable, to make sure the identifiers match.)

If I search the STG for the first culprit, I find the definition:

sat_s7mQ =
    CCCS \u []
        case ww2_s7mL of {
          I# y_s7mO ->
              case +# [1# y_s7mO] of sat_s7mP {
                __DEFAULT -> I# [sat_s7mP];
              };
        };

Yes, this is all very technical, but this is STG-speak for the expression 1 + y, which would help me zero in on the culprit.

If you don't speak STG, you can try introducing some cost centers. For example, I tried profiling only your second test case with -fprof-auto (Cabal flag --profiling-detail=all-functions). The profile output in Toy.prof isn't that useful for memory leaks because it deals with total allocation instead of active (i.e., resident and not garbage collected) allocations over time, but you can create a heap profile by cost center by running:

./Toy +RTS -hc

In this case, it attributes everything to a single cost center, namely (315)countNumberCalls. The "315" is the cost center number which you can look up in the Toy.prof input to find the exact source code lines, if it's not clear from the name. Anyway, this at least helps narrow down the problem to countNumberCalls.

For more complicated functions, you can sometimes narrow down the problem further by manually specifying cost centers, like so:

countNumberCalls :: (LogFunctionCalls m) => Int -> m Int
countNumberCalls 0 = return 0
countNumberCalls n = do
  {-# SCC "mytell_call" #-} myTell "countNumberCalls" 1
  x <- {-# SCC "recursive_call" #-} countNumberCalls $! n - 1
  {-# SCC "return_statment" #-} return $ {-# SCC "one_plus_x" #-} 1 + x

This actually attributes everything to "recursive_call", so it's not that helpful.

It's not wrong, though. You actually have two memory leaks here -- the x <- countNumberCalls $! n - 1 leaks heap because x isn't forced, and the 1 + x leaks stack. You could enable the BangPatterns extension and write:

!x <- countNumebrCalls $1 n - 1

and that would actually remove one of the memory leaks, speeding up the second case from 2.5 seconds to 1.0 seconds and dropping the maximum residency from 460 megs to 95 megs (and the bytes copied during GC from 1.5 Gigs to 73 kilobytes!). However, a heap profile would show linear growing stack accounting for pretty much all of that resident memory. Because stack isn't as well-tracked as heap, that would be more difficult to track down.

回答2:

You ask

return $ 1 + x [...] but how come ghc does not optimize that??

The answer is that strict evaluation and lazy evaluation have subtly different semantics, so having GHC optimise it might break your program.

The difference lies in the treatment of undefined values. Any attempt to evaluate an undefined throws an exception. In GHCi:

Prelude> undefined
*** Exception: Prelude.undefined
CallStack (from HasCallStack):
  error, called at libraries/base/GHC/Err.hs:79:14 in base:GHC.Err
  undefined, called at <interactive>:1:1 in interactive:Ghci1

If I have an expression that contains an undefined then the same thing happens:

Prelude> 2 + undefined
*** Exception: Prelude.undefined [...]

However if the evaluation never gets to the undefined then everything is fine:

Prelude> True || undefined
True

Haskell uses "non-strict semantics" and "lazy evaluation". Technically the non-strict semantics are part of the definition of Haskell and lazy evaluation is the implementation mechanism in GHC, but you can think of them as synonyms. When you define a variable the value is not computed immediately, so if you never use the variable then you have no problem:

Prelude> let b = undefined
Prelude> b
*** Exception: Prelude.undefined

The let works fine, but evaluating the variable it defines throws an exception.

Now consider your towering stack of unevaluated 1+ calls. GHC has no way of knowing in advance whether you are ever going to use the result (see below), and it also has no way of knowing whether or not there is an exception lurking in there somewhere. As a programmer you might know that there is an exception and carefully not look at the result, relying on the non-strict semantics of Haskell. If GHC prematurely evaluates and gets an exception your program will fail when it should not have.

Actually the GHC compiler includes a piece of optimisation called the Demand Analyser (it used to be called the Strictness Analyser) which looks for opportunities to optimise in exactly the way you want. However it has limits because it can only optimise computations when it can prove that the result is going to be evaluated.

Another wrinkle here is that you have used the State monad. This actually comes in two variants; Lazy and Strict. The Strict variant forces the state when it gets written, but the Lazy variant (the default) doesn't.

来源：https://stackoverflow.com/questions/61666819/haskell-how-to-detect-lazy-memory-leaks

标签

haskell

optimization

memory-leaks

lazy-evaluation