Scheme: why is Internal Definition faster than External Definition?

问题

I tried running the program below

(define (odd-internal x)
  (define (even x)
    (if (zero? x)
        #t
        (odd-internal (sub1 x))))
  (if (zero? x)
      #f
      (even (sub1 x))))

(define (odd-external x)
  (if (zero? x)
      #f
      (even (sub1 x))))

(define (even x)
  (if (zero? x)
      #t
      (odd-external (sub1 x))))

(begin (display "Using internal definition\n")
       (time (odd-internal 40000000)))

(begin (display "Using external definition\n")
       (time (odd-external 40000000)))

This is the result in Racket

Using internal definition
cpu time: 166 real time: 165 gc time: 0
#f
Using external definition
cpu time: 196 real time: 196 gc time: 0
#f

There you can see using internal definition is quite a bit faster. I've tried running on Chez Scheme and the result is similar. Why is that?

回答1:

I was amazed that it was a difference so from the commens of Lexis answer I split the two version in each their file internal.rkt and external.rkt and compiled them and decompiled them in this way:

raco make external.rkt
raco decompile compiled/external_rkt.zo

This goes one step further than looking at the fully expanded program in the macro stepper. It looks very non human readable so I have prettyfied it with the most important parts in tact:

(define (odd-external x1)
  (if (zero? x1)
      '#f
      (let ((x2 (sub1 x1)))
        (if (zero? x2)
            '#t
            (let ((x3 (sub1 x2)))
              (if (zero? x3)
                  '#f
                  (let ((x4 (sub1 x3)))
                    (if (zero? x4)
                        '#t
                        (let ((x5 (sub1 x4)))
                          (if (zero? x5) '#f (even (sub1 x5))))))))))))

(define (even x1)
  (if (zero? x1)
       '#t
       (let ((x2 (sub1 x1)))
         (if (zero? x2)
           '#f
           (let ((x3 (sub1 x2)))
             (if (zero? x3)
               '#t
               (let ((x4 (sub1 x3)))
                 (if (zero? x4)
                   '#f
                   (let ((x5 (sub1 x4)))
                     (if (zero? x5)
                       '#t
                       (let ((x6 (sub1 x5)))
                         (if (zero? x6)
                           '#f
                           (let ((x7 (sub1 x6)))
                             (if (zero? x7)
                               '#t
                               (odd-external (sub1 x7))))))))))))))))

Nothing special here. It unrolls the loop a certain times and constant folds. Notice we still have mutual recursion and that the unrolling is 5 and 7 times. The constant was even constant folded so it had replaced my call with (even 399999995) so the compiler had also run the code 5 turns and given up. The interesting thing is the internal version:

(define (odd-internal x1)
  (if (zero? x1)
      '#f
      (let ((x2 (sub1 x1)))
        (if (zero? x2)
            '#t
            (let ((x3 (sub1 x2)))
              (if (zero? x3)
                  '#f
                  (let ((x4 (sub1 x3)))
                    (if (zero? x4)
                        '#t
                        (let ((x5 (sub1 x4)))
                          (if (zero? x5)
                              '#f
                              (let ((x6 (sub1 x5)))
                                (if (zero? x6)
                                    '#t
                                    (let ((x7 (sub1 x6)))
                                      (if (zero? x7)
                                          '#f
                                          (let ((x8 (sub1 x7)))
                                            (if (zero? x8)
                                                '#t
                                                (odd-internal
                                                 (sub1 x8))))))))))))))))))

It is no longer mutual recursion since it calls itself after 8 times. An each round does 8 turns while the other version did 7, then 5.. In two rounds the internal one has done 16 rounds while the other has 12. The initial call on the internal one is (odd-internal '399999992) so the compiler did 8 rounds before giving up.

I guess the code in side the functions at the decompiler level are open coded and the code at each step is very cheap making the number of calls the reason for the 25% speed increase. After all 4 more is 25% more per recursion that coincides with the difference in computing time. This is speculations based on observation so it would be interesting to have a comment from Lexi on this.

回答2:

Your numbers are too small to be meaningful. The difference between 166 ms and 196 ms is, in absolute terms, tiny. Who knows what other factors could be influencing that? VM warmup time, differences in memory allocation, or any host of other things could easily cause a discrepancy of that size. To be sure, you should make the numbers much bigger.

On my machine, running Racket v7.0, I increased the arguments from 40000000 to 1000000000 and ran the program. The results were 2.361 s for the internal definition case and 2.212 s for the external definition case. Given the sorts of factors listed above, that difference is too small to be meaningful.

Benchmarking is hard, and benchmarking languages that run on VMs and are JIT compiled is harder. Even if you account for warmup and GC, run lots of iterations and take the averages, and generally try to do things right, the results you get could still be nearly meaningless, as the 2017 OOPSLA paper Virtual Machine Warmup Blows Hot and Cold explains:

Virtual Machines (VMs) with Just-In-Time (JIT) compilers are traditionally thought to execute programs in two phases: the initial warmup phase determines which parts of a program would most benefit from dynamic compilation, before JIT compiling those parts into machine code; subsequently the program is said to be at a steady state of peak performance. Measurement methodologies almost always discard data collected during the warmup phase such that reported measurements focus entirely on peak performance. We introduce a fully automated statistical approach, based on changepoint analysis, which allows us to determine if a program has reached a steady state and, if so, whether that represents peak performance or not. Using this, we show that even when run in the most controlled of circumstances, small, deterministic, widely studied microbenchmarks often fail to reach a steady state of peak performance on a variety of common VMs. Repeating our experiment on 3 different machines, we found that at most 43.5% of ⟨VM, benchmark⟩ pairs consistently reach a steady state of peak performance.

Emphasis mine. Make sure you’re measuring what you think you’re measuring.

回答3:

First off this will depend on your implementation since nested defines can be implemented in more than one way. On my Chez Scheme 9.5 setup I get a rather consistent 25% faster run time when I use odd-internal.

Now, for the why. This happens because nested defines (i.e. internal defines) are wildly different than actual defines.

When you use define on the top level, you are adding a new record to the free-variables table. Whenever you try to evaluate a variable that is not bound to any lambda, it is looked up in the free-variables (hash) table. This search is very efficient, but it's slower than fetching a bound variable. So when you calculate (odd-external 40000000) that you fetch even and odd-external from that table about 40mil times - even with caching and other cool stuff, this is still work to be done.

In contrast, nested defines create a bound variable. One way they are implemented is as nested lambda/let/letrec expressions. That way the odd-internal function would be transformed into^[1]:

(define (odd-internal x)
(let ((even (lambda (x)
              (if (zero? x)
                  #t
                  (odd-internal (sub1 x))))))
  (if (zero? x)
      #f
      (even (sub1 x)))))

(Which is a simplification of what Chez Scheme does).
Now every time you apply odd-internal it's still a free-variable, so you hash it and find it in the free-variables table. However, when you apply even, you just grab it from the environment (which can cost as little as a single memory dereference even without cool tricks).

A fun experiment would be to define both odd and even as bound variables, so all 40mil variable fetches would benefit from quick bound-variable fetch times. I saw a 16% improvement on top of the original 25%. Here's the code:

(define (odd-quick x)
    (define (odd x) (if (zero? x) #f (even (sub1 x))))
    (define (even x) (if (zero? x) #t (odd (sub1 x))))
    (odd x))

^_[1]let is a syntactic suger for a lambda application, so you can read that code as:

(define (odd-internal x)
    ((lambda (even)
      (if (zero? x)
          #f
          (even (sub1 x))))
     (lambda (x)
       (if (zero? x)
           #t
           (odd-internal (sub1 x))))))

来源：https://stackoverflow.com/questions/52688942/scheme-why-is-internal-definition-faster-than-external-definition

标签

performance

scheme

racket