Linux over commit heuristic

后端 未结 1 1060
面向向阳花
面向向阳花 2020-12-11 04:43

The over commit article from the kernel doc just mentions that over commit mode 0 is based on heuristic over commit handling. It does not outline the heuristic involved.

相关标签:
1条回答
  • 2020-12-11 05:09

    Actually, kernel documentation of overcommit accounting has some details: https://www.kernel.org/doc/Documentation/vm/overcommit-accounting

    The Linux kernel supports the following overcommit handling modes

    0 - Heuristic overcommit handling.

    Obvious overcommits of address space are refused. Used for a typical system. It ensures a seriously wild allocation fails while allowing overcommit to reduce swap usage. root is allowed to allocate slightly more memory in this mode. This is the default.

    Also Documentation/sysctl/vm.txt

    overcommit_memory: This value contains a flag that enables memory overcommitment.
    When this flag is 0, the kernel attempts to estimate the amount of free memory left when userspace requests more memory...

    See Documentation/vm/overcommit-accounting and mm/mmap.c::__vm_enough_memory() for more information.

    Also, man 5 proc:

    /proc/sys/vm/overcommit_memory This file contains the kernel virtual memory accounting mode. Values are:

                    0: heuristic overcommit (this is the default)
                    1: always overcommit, never check
                    2: always check, never overcommit
    

    In mode 0, calls of mmap(2) with MAP_NORESERVE are not checked, and the default check is very weak, leading to the risk of getting a process "OOM-killed".

    So, very huge allocations are disabled by heuristic, but sometimes application may allocate more virtual memory than size of physical memory in system, if it does not use all of it. With MAP_NORESERVE amount of mmapable memory may be higher.

    The setting is "The overcommit policy is set via the sysctl `vm.overcommit_memory'", so we can find how it is implemented in the source code: http://lxr.free-electrons.com/ident?v=4.4;i=sysctl_overcommit_memory, defined at line 112 of mm/mmap.c

      112 int sysctl_overcommit_memory __read_mostly = OVERCOMMIT_GUESS;  /* heuristic overcommit */
    

    and constant OVERCOMMIT_GUESS (defined in linux/mman.h) is used actually only in line 170 of mm/mmap.c, this is implementation of the heuristic:

    138 /*
    139  * Check that a process has enough memory to allocate a new virtual
    140  * mapping. 0 means there is enough memory for the allocation to
    141  * succeed and -ENOMEM implies there is not.
    142  *
    143  * We currently support three overcommit policies, which are set via the
    144  * vm.overcommit_memory sysctl.  See Documentation/vm/overcommit-accounting
    145  *
    146  * Strict overcommit modes added 2002 Feb 26 by Alan Cox.
    147  * Additional code 2002 Jul 20 by Robert Love.
    148  *
    149  * cap_sys_admin is 1 if the process has admin privileges, 0 otherwise.
    150  *
    151  * Note this is a helper function intended to be used by LSMs which
    152  * wish to use this logic.
    153  */
    154 int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
    ...
    170         if (sysctl_overcommit_memory == OVERCOMMIT_GUESS) {
    171                 free = global_page_state(NR_FREE_PAGES);
    172                 free += global_page_state(NR_FILE_PAGES);
    173 
    174                 /*
    175                  * shmem pages shouldn't be counted as free in this
    176                  * case, they can't be purged, only swapped out, and
    177                  * that won't affect the overall amount of available
    178                  * memory in the system.
    179                  */
    180                 free -= global_page_state(NR_SHMEM);
    181 
    182                 free += get_nr_swap_pages();
    183 
    184                 /*
    185                  * Any slabs which are created with the
    186                  * SLAB_RECLAIM_ACCOUNT flag claim to have contents
    187                  * which are reclaimable, under pressure.  The dentry
    188                  * cache and most inode caches should fall into this
    189                  */
    190                 free += global_page_state(NR_SLAB_RECLAIMABLE);
    191 
    192                 /*
    193                  * Leave reserved pages. The pages are not for anonymous pages.
    194                  */
    195                 if (free <= totalreserve_pages)
    196                         goto error;
    197                 else
    198                         free -= totalreserve_pages;
    199 
    200                 /*
    201                  * Reserve some for root
    202                  */
    203                 if (!cap_sys_admin)
    204                         free -= sysctl_admin_reserve_kbytes >> (PAGE_SHIFT - 10);
    205 
    206                 if (free > pages)
    207                         return 0;
    208 
    209                 goto error;
    210         }
    

    So, the heuristic is the way to estimate how many physical memory pages are used now (free), when request for more memory is processed (applications asks for pages pages).

    With always enabled overcommit ("1"), this function always returns 0 ("there is enough memory for this request")

    164         /*
    165          * Sometimes we want to use more memory than we have
    166          */
    167         if (sysctl_overcommit_memory == OVERCOMMIT_ALWAYS)
    168                 return 0;
    

    Without this default heuristic, in mode "2", kernel will try to account the requested pages pages to get new Committed_AS (from /proc/meminfo):

    162         vm_acct_memory(pages);
    ...
    

    this is actually just increment of vm_committed_as - __percpu_counter_add(&vm_committed_as, pages, vm_committed_as_batch);

    212         allowed = vm_commit_limit();
    

    Some magic is here:

    401 /*
    402  * Committed memory limit enforced when OVERCOMMIT_NEVER policy is used
    403  */
    404 unsigned long vm_commit_limit(void)
    405 {
    406         unsigned long allowed;
    407 
    408         if (sysctl_overcommit_kbytes)
    409                 allowed = sysctl_overcommit_kbytes >> (PAGE_SHIFT - 10);
    410         else
    411                 allowed = ((totalram_pages - hugetlb_total_pages())
    412                            * sysctl_overcommit_ratio / 100);
    413         allowed += total_swap_pages;
    414 
    415         return allowed;
    416 }
    417 
    

    So, allowed is set either as kilobytes in vm.overcommit_kbytes sysctl or as vm.overcommit_ratio as percentage of physical RAM, plus swap sizes.

    213         /*
    214          * Reserve some for root
    215          */
    216         if (!cap_sys_admin)
    217                 allowed -= sysctl_admin_reserve_kbytes >> (PAGE_SHIFT - 10);
    

    Allow some amount memory only for root (Page_shift is 12 for healthy person, page_shift-10 is just conversion from kbytes to page count).

    218 
    219         /*
    220          * Don't let a single process grow so big a user can't recover
    221          */
    222         if (mm) {
    223                 reserve = sysctl_user_reserve_kbytes >> (PAGE_SHIFT - 10);
    224                 allowed -= min_t(long, mm->total_vm / 32, reserve);
    225         }
    226 
    227         if (percpu_counter_read_positive(&vm_committed_as) < allowed)
    228                 return 0;
    

    If after accounting for request, all userspace still has memory amount committed less than allowed, allocate it. In other case, deny the request (and unaccount the request).

    229 error:
    230         vm_unacct_memory(pages);
    231 
    232         return -ENOMEM;
    

    In other words, as summed in "The Linux kernel. Some remarks on the Linux Kernel", 2003-02-01 by Andries Brouwer, 9. Memory, 9.6 Overcommit and OOM - https://www.win.tue.nl/~aeb/linux/lk/lk-9.html:

    Going in the right direction

    Since 2.5.30 the values are:

    • 0 (default): as before: guess about how much overcommitment is reasonable,
    • 1: never refuse any malloc(),
    • 2: be precise about the overcommit - never commit a virtual address space larger than swap space plus a fraction overcommit_ratio of the physical memory.

    So "2" is precise calculation of memory amount used after the request, and "0" is heuristic estimation.

    0 讨论(0)
提交回复
热议问题