Why is the second loop over a static array in the BSS faster than the first?

后端 未结 3 1559
隐瞒了意图╮
隐瞒了意图╮ 2021-01-13 10:37

I have the following code that writes a global array with zeros twice, once forward and once backward.

#include 
#include 
#inc         


        
3条回答
  •  予麋鹿
    予麋鹿 (楼主)
    2021-01-13 11:34

    When you defined some global data in C, it is zero-initialized:

    char c[SIZE];
    char c2[SIZE];
    

    In linux (unix) world this means, than both c and c2 will be allocated in special ELF file section, the .bss:

    ... data segment containing statically-allocated variables represented solely by zero-valued bits initially

    The .bss segment is created to not store all zeroes in the binary, it just says something like "this program wants to have 200MB of zeroed memory".

    When you program is loaded, ELF loader (kernel in case of classic static binaries, or ld.so dynamic loader also known as interp) will allocate the memory for .bss, usually like something like mmap with MAP_ANONYMOUS flag and READ+WRITE permissions/protection request.

    But memory manager in the OS kernel will not give you all 200 MB of zero memory. Instead it will mark part of virtual memory of your process as zero-initialized, and every page of this memory will point to the special zero page in physical memory. This page has 4096 bytes of zero byte, so if you are reading from c or c2, you will get zero bytes; and this mechanism allow kernel cut down memory requirements.

    The mappings to zero page are special; they are marked (in page table) as read-only. When you do first write to the any of such virtual pages, the General protection fault or pagefault exception will be generated by hardware (I'll say, by MMU and TLB). This fault will be handled by kernel, and in your case, by minor pagefault handler. It will allocate one physical page, fill it by zero bytes, and reset mapping of just-accesed virtual page to this physical page. Then it will rerun faulted instruction.

    I converted your code a bit (both loops are moved to separate function):

    $ cat b.c
    #include 
    #include 
    #include 
    #define SIZE 100000000
    
    char c[SIZE];
    char c2[SIZE];
    
    void FIRST()
    {
       int i;
       for(i = 0; i < SIZE; i++)
           c[i] = 0;
    }
    
    void SECOND()
    {
       int i;
       for(i = 0; i < SIZE; i++)
           c[i] = 0;
    }
    
    
    int main()
    {
       int i;
       clock_t t = clock();
       FIRST();
       t = clock() - t;
       printf("%d\n\n", t);
    
       t = clock(); 
       SECOND();
    
       t = clock() - t;
       printf("%d\n\n", t);
    }
    

    Compile with gcc b.c -fno-inline -O2 -o b, then run under linux's perf stat or more generic /usr/bin/time to get pagefault count:

    $ perf stat ./b
    139599
    
    93283
    
    
     Performance counter stats for './b':
     ....
                24 550 page-faults               #    0,100 M/sec           
    
    
    $ /usr/bin/time ./b
    234246
    
    92754
    
    Command exited with non-zero status 7
    0.18user 0.15system 0:00.34elapsed 99%CPU (0avgtext+0avgdata 98136maxresident)k
    0inputs+8outputs (0major+24576minor)pagefaults 0swaps
    

    So, we have 24,5 thousands of minor pagefaults. With standard page size on x86/x86_64 of 4096 this is near 100 megabytes.

    With perf record/perf report linux profiler we can find, where pagefaults occur (are generated):

    $ perf record -e page-faults ./b
    ...skip some spam from non-root run of perf...
    213322
    
    97841
    
    [ perf record: Woken up 1 times to write data ]
    [ perf record: Captured and wrote 0.018 MB perf.data (~801 samples) ]
    
    $ perf report -n |cat
    ...
    # Samples: 467  of event 'page-faults'
    # Event count (approx.): 24583
    #
    # Overhead       Samples  Command      Shared Object                   Symbol
    # ........  ............  .......  .................  .......................
    #
        98.73%           459        b  b                  [.] FIRST              
         0.81%             1        b  libc-2.19.so       [.] __new_exitfn       
         0.35%             1        b  ld-2.19.so         [.] _dl_map_object_deps
         0.07%             1        b  ld-2.19.so         [.] brk                
         ....
    

    So, now we can see, that only FIRST function generates pagefaults (on first write to bss pages), and SECOND does not generate any. Every pagefault corresponds to some work, done by OS kernel, and this work is done only one time per page of bss (because bss is not unmapped and remapped back).

提交回复
热议问题