Using move_pages() to move hugepages?

守給你的承諾、 提交于 2020-01-23 17:19:28

问题


This question is for:

  1. kernel 3.10.0-1062.4.3.el7.x86_64
  2. non transparent hugepages allocated via boot parameters and might or might not be mapped to a file (e.g. mounted hugepages)
  3. x86_64

According to this kernel source, move_pages() will call do_pages_move() to move a page, but I don't see how it indirectly calls migrate_huge_page().

So my questions are:

  1. can move_pages() move hugepages? if yes, should the page boundary be 4KB or 2MB when passing an array of addresses of pages? It seems like there was a patch for supporting moving hugepages 5 years ago.
  2. if move_pages() cannot move hugepages, how can I move hugepages?
  3. after moving hugepages, can I query the NUMA IDs of hugepages the same way I query regular pages like this answer?

According to the code below, it seems like I move hugepages via move_pages() with page size = 2MB but is it the correct way?:

#include <cstdint>
#include <iostream>
#include <numaif.h>
#include <sys/mman.h>
#include <fcntl.h>
#include <errno.h>
#include <unistd.h>
#include <string.h>
#include <limits>

int main(int argc, char** argv) {
        const int32_t dst_node = strtoul(argv[1], nullptr, 10);
        const constexpr uint64_t size = 4lu * 1024 * 1024;
        const constexpr uint64_t pageSize = 2lu * 1024 * 1024;
        const constexpr uint32_t nPages = size / pageSize;
        int32_t status[nPages];
        std::fill_n(status, nPages, std::numeric_limits<int32_t>::min());;
        void* pages[nPages];
        int32_t dst_nodes[nPages];
        void* ptr = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE | MAP_HUGETLB, -1, 0);

        if (ptr == MAP_FAILED) {
                throw "failed to map hugepages";
        }
        memset(ptr, 0x41, nPages*pageSize);
        for (uint32_t i = 0; i < nPages; i++) {
                pages[i] = &((char*)ptr)[i*pageSize];
                dst_nodes[i] = dst_node;
        }

        std::cout << "Before moving" << std::endl;

        if (0 != move_pages(0, nPages, pages, nullptr, status, 0)) {
            std::cout << "failed to inquiry pages because " << strerror(errno) << std::endl;
        }
        else {
                for (uint32_t i = 0; i < nPages; i++) {
                        std::cout << "page # " << i << " locates at numa node " << status[i] << std::endl;
                }
        }

        // real move
        if (0 != move_pages(0, nPages, pages, dst_nodes, status, MPOL_MF_MOVE_ALL)) {
                std::cout << "failed to move pages because " << strerror(errno) << std::endl;
                exit(-1);
        }

        const constexpr uint64_t smallPageSize = 4lu * 1024;
        const constexpr uint32_t nSmallPages = size / smallPageSize;
        void* smallPages[nSmallPages];
        int32_t smallStatus[nSmallPages] = {std::numeric_limits<int32_t>::min()};
        for (uint32_t i = 0; i < nSmallPages; i++) {
                smallPages[i] = &((char*)ptr)[i*smallPageSize];
        }


        std::cout << "after moving" << std::endl;
        if (0 != move_pages(0, nSmallPages, smallPages, nullptr, smallStatus, 0)) {
            std::cout << "failed to inquiry pages because " << strerror(errno) << std::endl;
        }
        else {
                for (uint32_t i = 0; i < nSmallPages; i++) {
                        std::cout << "page # " << i << " locates at numa node " << smallStatus[i] << std::endl;
                }
        }

}

And should I query the NUMA IDs based on 4KB page size (like the code above)? Or 2MB?


回答1:


For original version of 3.10 linux kernel (not redhat patched, as I have no LXR for rhel kernels) syscall move_pages will force splitting huge page (2MB; both THP and hugetlbfs styles) into small pages (4KB). move_pages uses too short chunks (around 0.5MB if I calculated correctly) and the function graph is like:

move_pages .. -> migrate_pages -> unmap_and_move ->

static int unmap_and_move(new_page_t get_new_page, unsigned long private,
            struct page *page, int force, enum migrate_mode mode)
{
    struct page *newpage = get_new_page(page, private, &result);
    ....
    if (unlikely(PageTransHuge(page)))
        if (unlikely(split_huge_page(page)))
            goto out;

PageTransHuge returns true for both kinds of hugepages (thp and libhugetlbs): https://elixir.bootlin.com/linux/v3.10/source/include/linux/page-flags.h#L411

PageTransHuge() returns true for both transparent huge and hugetlbfs pages, but not normal pages.

And split_huge_page will call split_huge_page_to_list which:

Split a hugepage into normal pages. This doesn't change the position of head page.

Split will also emit vm_event counter increment of kind THP_SPLIT. The counters are exported in /proc/vmstat ("file displays various virtual memory statistics"). You can check this counter with this UUOC command cat /proc/vmstat |grep thp_split before and after your test.

There were some code for hugepage migration in 3.10 version as unmap_and_move_huge_page function which is not called from move_pages. The only usage of it in 3.10 was in migrate_huge_page which is called only from memory failure handler soft_offline_huge_page (__soft_offline_page) (added 2010):

Soft offline a page, by migration or invalidation, without killing anything. This is for the case when a page is not corrupted yet (so it's still valid to access), but has had a number of corrected errors and is better taken out.

Answers:

can move_pages() move hugepages? if yes, should the page boundary be 4KB or 2MB when passing an array of addresses of pages? It seems like there was a patch for supporting moving hugepages 5 years ago.

Standard 3.10 kernel have move_pages which will accept array "pages" of 4KB page pointers and it will break (split) huge page into 512 small pages and then it will migrate small pages. There are very low chances for them to be merged back by thp as move_pages does separate requests for physical memory pages and they almost always will be non-continuous.

Don't give pointers to "2MB", it will still split every huge page mentioned and migrate only first 4KB small page of this memory.

2013 patch was not added into original 3.10 kernel.

  • v2 https://lwn.net/Articles/544044/ "extend hugepage migration" (3.9);
  • v3 https://lwn.net/Articles/559575/ (3.11)
  • v4 https://lore.kernel.org/patchwork/cover/395020/ (click on Related to get access to individual patches, for example move_pages patch)

The patch seems to be accepted in September 2013: https://github.com/torvalds/linux/search?q=+extend+hugepage+migration&type=Commits

if move_pages() cannot move hugepages, how can I move hugepages?

move_pages will move data from hugepages as small pages. You can: allocate huge page in manual mode at correct numa node and copy your data (copy twice if you want to keep virtual address); or update kernel to some version with the patch and use methods and tests of patch author, Naoya Horiguchi (JP). There is copy of his tests: https://github.com/srikanth007m/test_hugepage_migration_extension (https://github.com/Naoya-Horiguchi/test_core is required)

https://github.com/srikanth007m/test_hugepage_migration_extension/blob/master/test_move_pages.c

Now I'm not sure how to start the test and how to check that it works correctly. For ./test_move_pages -v -m private -h 2048 runs with recent kernel it does not increment THP_SPLIT counter.

His test looks very similar to our tests: mmap, memset to fault pages, filling pages array with pointers to small pages, numa_move_pages

after moving hugepages, can I query the NUMA IDs of hugepages the same way I query regular pages like this answer?

You can query status of any memory by providing correct array "pages" to move_pages syscall in query mode (with null nodes). Array should list every small page of the memory region you want to check.

If you know any reliable method to check if the memory mapped to huge page or not, you can query any small page of huge page. I think that there can be probabilistic method if you can export physical address out of kernel to the user-space (using some LKM module for example): for huge page virtual and physical addresses will always have 21 common LSB bits, and for small pages bits will coincide only for 1 test in million. Or just write LKM to export PMD Directory.



来源:https://stackoverflow.com/questions/59726288/using-move-pages-to-move-hugepages

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!