本文只分析THP的拆分过程，先不说为什么在哪儿需要拆分。

拆分函数一览

split_huge_page_to_list

split_huge_page

__split_huge_page_pmd

split_huge_page_pmd

split_huge_page_pmd_mm

split_huge_page_to_list

mm/huge_memory.c

/*
 * Split a hugepage into normal pages. This doesn't change the position of head
 * page. If @list is null, tail pages will be added to LRU list, otherwise, to
 * @list. Both head page and tail pages will inherit mapping, flags, and so on
 * from the hugepage.
 * Return 0 if the hugepage is split successfully otherwise return 1.
 */
int split_huge_page_to_list(struct page *page, struct list_head *list)
{
        struct anon_vma *anon_vma;
        int ret = 1;

        BUG_ON(is_huge_zero_page(page));   //全局零THP不能分割
        BUG_ON(!PageAnon(page));

        /*
         * The caller does not necessarily hold an mmap_sem that would prevent
         * the anon_vma disappearing so we first we take a reference to it
         * and then lock the anon_vma for write. This is similar to
         * page_lock_anon_vma_read except the write lock is taken to serialise
         * against parallel split or collapse operations.
         */
        anon_vma = page_get_anon_vma(page);
        if (!anon_vma)
                goto out;
        anon_vma_lock_write(anon_vma);  //获取anon_vma的写锁

        ret = 0;
        if (!PageCompound(page))  //THP必须是复合页
                goto out_unlock;

        BUG_ON(!PageSwapBacked(page));
        __split_huge_page(page, anon_vma, list);
        count_vm_event(THP_SPLIT);

        BUG_ON(PageCompound(page));
out_unlock:
        anon_vma_unlock_write(anon_vma);
        put_anon_vma(anon_vma);
out:
        return ret;
}

l 要分割的页不能是全局零THP
l THP的首页的mapping字段必须设置PAGE_MAPPING_ANON标志（PageAnon），这是表明该THP的首页描述符的mapping字段必须指向anon_vma对象。
l 因为调用者不需要获取mm的mmap_sem，所以该THP关联的anon_vma可能在我们操作过程中消失。所以我们必须对其增加一个引用（page_get_anon_vma）并获取anon_vma树的根的写锁（anon_vma_lock_write）。到这步完成，如果在接下来的拆分过程中不出现bug，那么肯定返回0，即成功。
l THP当然要是复合页。
l THP首页必须设置了PG_swapbacked标志。
l 调用__split_huge_page()函数进行拆分。
增加THP_SPLIT计数

__split_huge_page()函数如下：

mm/huge_memory.c

/* must be called with anon_vma->root->rwsem held */
static void __split_huge_page(struct page *page,
                              struct anon_vma *anon_vma,
                              struct list_head *list)
{
        int mapcount, mapcount2;
        pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
        struct anon_vma_chain *avc;

        BUG_ON(!PageHead(page));
        BUG_ON(PageTail(page));

        mapcount = 0;
        anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff, pgoff) {
                struct vm_area_struct *vma = avc->vma;
                unsigned long addr = vma_address(page, vma);
                BUG_ON(is_vma_temporary_stack(vma));
                mapcount += __split_huge_page_splitting(page, vma, addr);
        }
        /*
         * It is critical that new vmas are added to the tail of the
         * anon_vma list. This guarantes that if copy_huge_pmd() runs
         * and establishes a child pmd before
         * __split_huge_page_splitting() freezes the parent pmd (so if
         * we fail to prevent copy_huge_pmd() from running until the
         * whole __split_huge_page() is complete), we will still see
         * the newly established pmd of the child later during the
         * walk, to be able to set it as pmd_trans_splitting too.
         */
        if (mapcount != page_mapcount(page)) {
                pr_err("mapcount %d page_mapcount %d\n",
                        mapcount, page_mapcount(page));
                BUG();
        }

        __split_huge_page_refcount(page, list);

        mapcount2 = 0;
        anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff, pgoff) {
                struct vm_area_struct *vma = avc->vma;
                unsigned long addr = vma_address(page, vma);
                BUG_ON(is_vma_temporary_stack(vma));
                mapcount2 += __split_huge_page_map(page, vma, addr);
        }
        if (mapcount != mapcount2) {
                pr_err("mapcount %d mapcount2 %d page_mapcount %d\n",
                        mapcount, mapcount2, page_mapcount(page));
                BUG();
        }
}

如代码中注释，该函数主要就分为3大步：
1. 在映射的每一个pmd项中设定splitting标志（__split_huge_page_splitting()）表明该大页正在被分割
2. 分割大页为普通的4K页（__split_huge_page_refcount()）
3. 为分割开来的4K页建立页表映射（__split_huge_page_map()）
注意两次mapcount的检查，是为了防止在执行该拆分函数的过程中其它地方又建立了对该大页的映射，当然这是bug。

我们下面对每个步骤进行详细描述。

__split_huge_page_splitting

mm/huge_memory.c

static int __split_huge_page_splitting(struct page *page,
                                       struct vm_area_struct *vma,
                                       unsigned long address)
{
        struct mm_struct *mm = vma->vm_mm;
        spinlock_t *ptl;
        pmd_t *pmd;
        int ret = 0;
        /* For mmu_notifiers */
        const unsigned long mmun_start = address;
        const unsigned long mmun_end   = address + HPAGE_PMD_SIZE;

        mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
        //如果该mm中address对应的pmd项存在且为大页映射则返回该pmd项
        pmd = page_check_address_pmd(page, mm, address,
                        PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG, &ptl);
        if (pmd) {
                /*
                 * We can't temporarily set the pmd to null in order
                 * to split it, the pmd must remain marked huge at all
                 * times or the VM won't take the pmd_trans_huge paths
                 * and it won't wait on the anon_vma->root->rwsem to
                 * serialize against split_huge_page*.
                 */
                pmdp_splitting_flush(vma, address, pmd);

                ret = 1;
                spin_unlock(ptl);
        }
        mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);

        return ret;
}

这个函数很简单，就是如果查找页表找到的MM中地址对应pmd项是一个大页映射项（page_check_address_pmd()），那么就在该pmd项中设定_PAGE_BIT_SPLITTING标志，向VM其它部分表明该pmd项映射的大页正在被分割。代码注释里说了设定_PAGE_BIT_SPLITTING标志而不是简单的情况pmd项的理由：因为pmd项必须一直被标记为大页表项，否则vm的其它部分就不会检测到这是一个大页，这导致它不会等待保护大页拆分的anon_vma->root->rwsem。

__split_huge_page_refcount

__split_huge_page_refcount()函数如下：

mm/huge_memory.c

static void __split_huge_page_refcount(struct page *page,
                                       struct list_head *list)
{
        int i;
        struct zone *zone = page_zone(page);
        struct lruvec *lruvec;
        int tail_count = 0;

        /* prevent PageLRU to go away from under us, and freeze lru stats */
        //这个锁的获取让后面对普通4K页的操作更简单，可以直接将它们放回LRU链表中去
        spin_lock_irq(&zone->lru_lock);
        lruvec = mem_cgroup_page_lruvec(page, zone);

        compound_lock(page);
        /* complete memcg works before add pages to LRU */
        mem_cgroup_split_huge_fixup(page);

        //对所有尾页进行处理
        for (i = HPAGE_PMD_NR - 1; i >= 1; i--) {
                struct page *page_tail = page + i;

                /* tail_page->_mapcount cannot change */
                BUG_ON(page_mapcount(page_tail) < 0);
                tail_count += page_mapcount(page_tail);  //尾页的引用计数存放在__mapcount中
                /* check for overflow */
                BUG_ON(tail_count < 0);
                BUG_ON(atomic_read(&page_tail->_count) != 0);  //尾页的__count字段必须为0
                /*
                 * tail_page->_count is zero and not changing from
                 * under us. But get_page_unless_zero() may be running
                 * from under us on the tail_page. If we used
                 * atomic_set() below instead of atomic_add(), we
                 * would then run atomic_set() concurrently with
                 * get_page_unless_zero(), and atomic_set() is
                 * implemented in C not using locked ops. spin_unlock
                 * on x86 sometime uses locked ops because of PPro
                 * errata 66, 92, so unless somebody can guarantee
                 * atomic_set() here would be safe on all archs (and
                 * not only on x86), it's safer to use atomic_add().
                 */
                //尾页拆分后，其引用计数包括进程的映射引用计数 + 内核对其的pin个数+ 1
                //这里加1是需要pin住尾页进行处理，函数后面会unpin掉
                atomic_add(page_mapcount(page) + page_mapcount(page_tail) + 1,
                           &page_tail->_count);

                /* after clearing PageTail the gup refcount can be released */
                smp_mb();

                //下面根据首页的页标志设置尾页的标志（这里会过滤掉尾页标志），并且将尾页设为脏
                /*
                 * retain hwpoison flag of the poisoned tail page:
                 *   fix for the unsuitable process killed on Guest Machine(KVM)
                 *   by the memory-failure.
                 */
                page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP | __PG_HWPOISON;
                page_tail->flags |= (page->flags &
                                     ((1L << PG_referenced) |
                                      (1L << PG_swapbacked) |
                                      (1L << PG_mlocked) |
                                      (1L << PG_uptodate) |
                                      (1L << PG_active) |
                                      (1L << PG_unevictable)));
                page_tail->flags |= (1L << PG_dirty);

                /* clear PageTail before overwriting first_page */
                smp_wmb();

                /*
                 * __split_huge_page_splitting() already set the
                 * splitting bit in all pmd that could map this
                 * hugepage, that will ensure no CPU can alter the
                 * mapcount on the head page. The mapcount is only
                 * accounted in the head page and it has to be
                 * transferred to all tail pages in the below code. So
                 * for this code to be safe, the split the mapcount
                 * can't change. But that doesn't mean userland can't
                 * keep changing and reading the page contents while
                 * we transfer the mapcount, so the pmd splitting
                 * status is achieved setting a reserved bit in the
                 * pmd, not by clearing the present bit.
                */
                //这段注释说的还是为什么不是情况pmd而是设置了一个拆分标志在pmd项中
                page_tail->_mapcount = page->_mapcount;

                BUG_ON(page_tail->mapping);
                page_tail->mapping = page->mapping;

                page_tail->index = page->index + i;
                page_cpupid_xchg_last(page_tail, page_cpupid_last(page));

                BUG_ON(!PageAnon(page_tail));
                BUG_ON(!PageUptodate(page_tail));
                BUG_ON(!PageDirty(page_tail));
                BUG_ON(!PageSwapBacked(page_tail));

                //将拆分出来的尾页加入LRU链表中去
                lru_add_page_tail(page, page_tail, lruvec, list);
        }
        //因为对尾页的pin引用计数都会累积到首页中，所以拆分的时候要将首页中累积的尾页的引用计数减掉
        atomic_sub(tail_count, &page->_count);
        BUG_ON(atomic_read(&page->_count) <= 0);

        __mod_zone_page_state(zone, NR_ANON_TRANSPARENT_HUGEPAGES, -1);

        ClearPageCompound(page);  //清除复合页标记
        compound_unlock(page);
        spin_unlock_irq(&zone->lru_lock);

        //这里减去上面pin尾页而增加的引用
        for (i = 1; i < HPAGE_PMD_NR; i++) {
                struct page *page_tail = page + i;
                BUG_ON(page_count(page_tail) <= 0);
                /*
                 * Tail pages may be freed if there wasn't any mapping
                 * like if add_to_swap() is running on a lru page that
                 * had its mapping zapped. And freeing these pages
                 * requires taking the lru_lock so we do the put_page
                 * of the tail pages after the split is complete.
                 */
                put_page(page_tail);
        }

        /*
         * Only the head page (now become a regular page) is required
         * to be pinned by the caller.
         */
        BUG_ON(page_count(page) <= 0);  //首页已经被调用者pin住了，所以引用计数肯定不为0
}

该函数最主要的就是对拆分后页的引用计数的处理，对拆分后页的标志的处理和页本身的处理。

尾页的引用计数包含进程对其的映射引用（存放在首页的__mapcount）加上内核对它自身的pin引用计数（存放在它自己的__mapcount）
首页的引用计数中要减去所有累积在它身上的各个尾页的pin引用计数

每个页进行拆分过后都要有其归属地：

mm/swap.c

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
/* used by __split_huge_page_refcount() */
void lru_add_page_tail(struct page *page, struct page *page_tail,
                       struct lruvec *lruvec, struct list_head *list)
{
        const int file = 0;

        VM_BUG_ON_PAGE(!PageHead(page), page);
        VM_BUG_ON_PAGE(PageCompound(page_tail), page);
        VM_BUG_ON_PAGE(PageLRU(page_tail), page);
        VM_BUG_ON(NR_CPUS != 1 &&
                  !spin_is_locked(&lruvec_zone(lruvec)->lru_lock));

        //没有指定list，就将页加入到LRU链表中去
        if (!list)
                SetPageLRU(page_tail);

        //如果首页在LRU链表中，就将尾页加入加入到首页所在的LRU链表尾部
        if (likely(PageLRU(page)))
                list_add_tail(&page_tail->lru, &page->lru);
        else if (list) {
                /* page reclaim is reclaiming a huge page */
                get_page(page_tail);
                list_add_tail(&page_tail->lru, list);
        } else {
                struct list_head *list_head;
                /*
                 * Head page has not yet been counted, as an hpage,
                 * so we must account for each subpage individually.
                 *
                 * Use the standard add function to put page_tail on the list,
                 * but then correct its position so they all end up in order.
                 */
                add_page_to_lru_list(page_tail, lruvec, page_lru(page_tail));
                list_head = page_tail->lru.prev;
                list_move_tail(&page_tail->lru, list_head);
        }

        if (!PageUnevictable(page))
                update_page_reclaim_stat(lruvec, file, PageActive(page_tail));
}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */

如果指定了list，那么是在页面回收时要将大页拆开回收，那么在加入list之前，会递增尾页的引用（pin住它）
否则，就将页加入LRU链表中去（因为该函数在获取lru_lock的情况下被调用，那么可以直接操作LRU链表）

优先将尾页加入到首页所在LRU链表
否则，根据尾页的标志将其加入到对应的LRU链表中去。注意还要调整其在LRU链表中的位置。

__split_huge_page_map

__split_huge_page_map()函数如下：

mm/huge_memory.c

static int __split_huge_page_map(struct page *page,
                                 struct vm_area_struct *vma,
                                 unsigned long address)
{
        struct mm_struct *mm = vma->vm_mm;
        spinlock_t *ptl;
        pmd_t *pmd, _pmd;
        int ret = 0, i;
        pgtable_t pgtable;
        unsigned long haddr;

        //获取pmd，注意该函数会检查一定是获取正在被拆分的pmd，否则bug
        pmd = page_check_address_pmd(page, mm, address,
                        PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG, &ptl);
        if (pmd) {
                //获取一个存放页表的页
                pgtable = pgtable_trans_huge_withdraw(mm, pmd);
                //将_pmd设置为指向页表
                pmd_populate(mm, &_pmd, pgtable);

                haddr = address;
                for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
                        pte_t *pte, entry;
                        BUG_ON(PageCompound(page+i));
                        entry = mk_pte(page + i, vma->vm_page_prot);
                        entry = maybe_mkwrite(pte_mkdirty(entry), vma);
                        //pmd要么是写保护的，要么只能有一个进程在映射大页
                        if (!pmd_write(*pmd))
                                entry = pte_wrprotect(entry);
                        else
                                BUG_ON(page_mapcount(page) != 1);
                        if (!pmd_young(*pmd))
                                entry = pte_mkold(entry);
                        if (pmd_numa(*pmd))
                                entry = pte_mknuma(entry);
                        pte = pte_offset_map(&_pmd, haddr);  //从页表中获取对应的pte
                        BUG_ON(!pte_none(*pte));
                        set_pte_at(mm, haddr, pte, entry);  //尘埃落定
                        pte_unmap(pte);
                }
                }

                smp_wmb(); /* make pte visible before pmd */
                /*
                 * Up to this point the pmd is present and huge and
                 * userland has the whole access to the hugepage
                 * during the split (which happens in place). If we
                 * overwrite the pmd with the not-huge version
                 * pointing to the pte here (which of course we could
                 * if all CPUs were bug free), userland could trigger
                 * a small page size TLB miss on the small sized TLB
                 * while the hugepage TLB entry is still established
                 * in the huge TLB. Some CPU doesn't like that. See
                 * http://support.amd.com/us/Processor_TechDocs/41322.pdf,
                 * Erratum 383 on page 93. Intel should be safe but is
                 * also warns that it's only safe if the permission
                 * and cache attributes of the two entries loaded in
                 * the two TLB is identical (which should be the case
                 * here). But it is generally safer to never allow
                 * small and huge TLB entries for the same virtual
                 * address to be loaded simultaneously. So instead of
                 * doing "pmd_populate(); flush_tlb_range();" we first
                 * mark the current pmd notpresent (atomically because
                 * here the pmd_trans_huge and pmd_trans_splitting
                 * must remain set at all times on the pmd until the
                 * split is complete for this pmd), then we flush the
                 * SMP TLB and finally we write the non-huge version
                 * of the pmd entry with pmd_populate.
                 */
                pmdp_invalidate(vma, address, pmd);
                pmd_populate(mm, pmd, pgtable);  //更新pmd项
                ret = 1;
                spin_unlock(ptl);
        }

        return ret;
}

该函数就是将pmd项重新映射指向页表，页表中的每一个pte重新映射拆分后的普通4K页。同时注意设置pte中的标志位。

未完待续。。。

来源：oschina

链接：https://my.oschina.net/u/2302134/blog/364876

标签

Linux

THP

云计算

Linux大页之THP拆分

拆分函数一览

split_huge_page_to_list

__split_huge_page_splitting

__split_huge_page_refcount

__split_huge_page_map