Why does deep usage of the stack cause superlinear time behavior for a simple interpreter?

问题

type Expr =
    | Lit of int
    | Add of Expr * Expr

let rec intr = function
    | Lit _ as x -> x
    | Add(Lit a,Lit b) -> Lit <| a + b
    | Add(a,b) -> intr <| Add(intr a, intr b)

let rec intr_cps x ret =
    match x with
    | Lit _ as x -> ret x
    | Add(Lit a,Lit b) -> Lit (a + b) |> ret
    | Add(a,b) -> 
        intr_cps a <| fun a ->
            intr_cps b <| fun b ->
                intr_cps (Add(a, b)) ret

let rec add n =
    if n > 1 then Add(Lit 1, add (n-1))
    else Lit 1

open System.Threading

let mem = 1024*1024*512 // ~536mb
// It stack overflows without being spun on a separate thread.
// By default, the program only has a few mb of stack memory at its disposal.
let run f = Thread(ThreadStart f,mem).Start() 

run <| fun _ ->
    let f n =
        let x = add n
        let stopwatch = System.Diagnostics.Stopwatch.StartNew()
        printfn "%A" (intr x)
        printfn "n_%i_std = %A" n stopwatch.Elapsed

        stopwatch.Restart()
        printfn "%A" (intr_cps x id)
        printfn "n_%i_cps = %A" n stopwatch.Elapsed
    f <| 1000*1000/2
    f <| 1000*1000
    f <| 1000*1000*2

//Lit 500000
//n_500000_std = 00:00:00.7764730
//Lit 500000
//n_500000_cps = 00:00:00.0800371
//Lit 1000000
//n_1000000_std = 00:00:02.9531043
//Lit 1000000
//n_1000000_cps = 00:00:00.1941828
//Lit 2000000
//n_2000000_std = 00:00:13.7823780
//Lit 2000000
//n_2000000_cps = 00:00:00.2767752

I have a much bigger interpreter whose performance behavior I am trying to better understand so I made the above. I am definitely sure now that the superlinear time scaling I am seeing in it on some examples is related to the way it uses the stack, but I am not sure why this is happening so I wanted to ask here.

As you can see, as I vary the n by 2x, the time varies much more than that, and it seems like the scaling is exponential which is surprising to me. Also it is surprising that the CPS'd interpreter is faster than the stack based one. Why is that?

I also want to ask if I would see this same behavior if I coded the equivalent of the above in a non .NET language or even C?

回答1:

Looks like most of what you're measuring is building the data structure. Factor out

let data = add n

outside the time measurement (and replace add n with data inside) and the CPS goes linear.

I don't know enough about threads with large stacks and memory performance to explain the rest offhand, and haven't profiled the memory to get any feel.

回答2:

I did some detective work and can answer that the reason for the excessively long running times for the stack based interpreter is the GC. The first thing I tried was compiling the program in 32-bit mode and was surprised to find out that I got these timings:

Lit 500000
n_500000_std = 00:00:00.3964533
Lit 500000
n_500000_cps = 00:00:00.0945109
Lit 1000000
n_1000000_std = 00:00:01.6021848
Lit 1000000
n_1000000_cps = 00:00:00.2143892
Lit 2000000
n_2000000_std = 00:00:08.0540017
Lit 2000000
n_2000000_cps = 00:00:00.3823931

As you can see, the stack based interpreter is 2x faster compared to 64-bit mode. I removed the CPS'd interpreter from the benchmark and ran the program with the PerfView tool. My initial hypothesis was that the excessive running times are caused by the GC.

CommandLine: "IntepreterBenchmark.exe"
Runtime Version: V 4.0.30319.0 (built on 6/6/2017 10:30:00 PM)
CLR Startup Flags: CONCURRENT_GC
Total CPU Time: 19,306 msec
Total GC CPU Time: 17,436 msec
Total Allocs : 202.696 MB
GC CPU MSec/MB Alloc : 86.020 MSec/MB
Total GC Pause: 17,421.9 msec
% Time paused for Garbage Collection: 90.2%
% CPU Time spent Garbage Collecting: 90.3%

It was in fact correct. I read that the GC has to walk the stack before every collection and that has strong implications for the way program should be structured in .NET, but I do not understand GC well enough to comment on why dependencies between datatypes are left alone.

The above measurement is for the 32-bit mode. With the PerfView tool, the 64-bit measurements are broken and took 15 times as long to finish for unknown reason.

I also can't explain why 32-bit mode is 2x faster on the original benchmark since it is not like the stack would be 2x bigger compared to 64-bit mode.

来源：https://stackoverflow.com/questions/45662291/why-does-deep-usage-of-the-stack-cause-superlinear-time-behavior-for-a-simple-in

标签

.net

memory