GHC's RTS options for garbage collection

独自空忆成欢 提交于 2019-11-27 17:01:42

Generally speaking, garbage collection is a space/time tradeoff. Give the GC more space, and it will take less time. There are (many) other factors in play, cache in particular, but the space/time tradeoff is the most important one.

The tradeoff works like this: the program allocates memory until it reaches some limit (decided by the GC's automatic tuning parameters, or explicitly via RTS options). When the limit is reached, the GC traces all the data that the program is currently using, and reclaims all the memory used by data that is no longer required. The longer you can delay this process, the more data will have become unreachable ("dead") in the meantime, so the GC avoids tracing that data. The only way to delay a GC is to make more memory available to allocate into; hence more memory equals fewer GCs, equals lower GC overhead. Roughly speaking, GHC's -H option lets you set a lower bound on the amount of memory used by the GC, so can lower the GC overhead.

GHC uses a generational GC, which is an optimisation to the basic scheme, in which the heap is divided into two or more generations. Objects are allocated into the "young" generation, and the ones that live long enough get promoted into the "old" generation (in a 2-generation setup). The young generation is collected much more frequently than the old generation, the idea being that "most objects die young", so young-generation collections are cheap (they don't trace much data), but they reclaim a lot of memory. Roughly speaking, the -A option sets the size of the young generation - that is, the amount of memory that will be allocated before the young generation is collected.

The default for -A is 512k: it's a good idea to keep the young generation smaller than the L2 cache, and performance generally drops if you exceed the L2 cache size. But working in the opposite direction is the GC space/time tradeoff: using a very large young generation size might outweigh the cache benefits by reducing the amount of work the GC has to do. This doesn't always happen, it depends on the dynamics of the application, which makes it hard for the GC to automatically tune itself. The -H option also increases the young generation size, so can also have an adverse effect on cache usage.

The bottom line is: play around with the options and see what works. If you have plenty of memory to spare, you might well be able to get a performance increase by using either -A or -H, but not necessarily.

It is probably possible to reproduce the issue for smaller data sets, where it will be easier to see what is going on. In particular, I suggest to gain some familiarity with profiling:

Then, check whether the memory profiles you see match your expectations (you don't need to know about all the profiling options to get useful graphs). The combination of strict foldl' with a non-strict tuple as the accumulator would be the first thing I'd look at: if the components of the tuple are not forced, that recursion is building up unevaluated expressions.

Btw, you can build up a good intuition about such things by trying to evaluate your code by hand for really tiny data sets. A few iterations will suffice to see whether expressions get evaluated or remain unevaluated in the way that fits your application.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!