问题

Looking into the internals of Linux and memory management, I just stumbled upon the segmented paging model that Linux uses.

Correct me if I am wrong, but Linux (protected mode) does use paging for mapping a linear virtual address space to the physical address space. This linear address space constituted of pages, is split into four segments for the process flat memory model, namely:

The kernel code segment (__KERNEL_CS);
The kernel data segment (__KERNEL_DS);
The user code segment (__USER_CS);
The user data segment (__USER_DS);

A fifth memory segment known as the Null segment is present but unused.

These segments have a CPL (Current Privilege Level) of either 0 (supervisor) or 3 (userland).

To keep it simple, I will concentrate of the 32-bit memory mapping, with a 4GiB adressable space, 3GiB being for the userland process space (shown in green), 1GiB being for the supervisor kernel space (shown in red):

~~So the red part consists of two segments __KERNEL_CS and __KERNEL_DS, and the green part of two segments __USER_CS and __USER_DS.~~

These segments overlap each others. Paging will be used for userland and kernel isolation.

However, as extracted from Wikipedia here:

[...] many 32-bit operating systems simulate a flat memory model by setting all segments' bases to 0 in order to make segmentation neutral to programs.

Looking into the linux kernel code for the GDT here:

[GDT_ENTRY_KERNEL32_CS]       = GDT_ENTRY_INIT(0xc09b, 0, 0xfffff),
[GDT_ENTRY_KERNEL_CS]         = GDT_ENTRY_INIT(0xa09b, 0, 0xfffff),
[GDT_ENTRY_KERNEL_DS]         = GDT_ENTRY_INIT(0xc093, 0, 0xfffff),
[GDT_ENTRY_DEFAULT_USER32_CS] = GDT_ENTRY_INIT(0xc0fb, 0, 0xfffff),
[GDT_ENTRY_DEFAULT_USER_DS]   = GDT_ENTRY_INIT(0xc0f3, 0, 0xfffff),
[GDT_ENTRY_DEFAULT_USER_CS]   = GDT_ENTRY_INIT(0xa0fb, 0, 0xfffff),

As Peter pointed out, each segment begin at 0, but what are those flags, namely 0xc09b, 0xa09b and so on ? I tend to believe they are the segments selectors, if not, how would I be able to access the userland segment from the kernel segment, if both their addressing space start at 0 ?

Segmentation is not used. Only paging is used. Segments have their seg_base addresses set 0, extending their space to 0xFFFFF and thus giving a full linear address space. That means that logical addresses are not different from linear addresses.

~~Also, since all segments overlap each others, is it the paging unit which provides memory protection (i.e. the memory separation) ?~~

Paging provide protection, not segmentation. The kernel will check the linear address space, and, according to a boundary (often known as TASK_MAX), will check the privilege level for the requested page.

回答1:

Yes, Linux uses paging so all addresses are always virtual. (To access memory at a known physical address, Linux keeps all physical memory 1:1 mapped to a range of kernel virtual address space, so it can simply index into that "array" using the physical address as the offset. Modulo complications for 32-bit kernels on systems with more physical RAM than kernel address space.)

This linear address space constituted of pages, is split into four segments

No, Linux uses a flat memory model. The base and limit for all 4 of those segment descriptors are 0 and -1 (unlimited). i.e. they all fully overlap, covering the entire 32-bit virtual linear address space.

So the red part consists of two segments __KERNEL_CS and __KERNEL_DS

No, this is where you went wrong. x86 segment registers are not used for segmentation; they're x86 legacy baggage that's only used for CPU mode and privilege-level selection on x86-64. Instead of adding new mechanisms for that and dropping segments entirely for long mode, AMD just neutered segmentation in long mode (base fixed at 0 like everyone used in 32-bit mode anyway) and kept using segments only for machine-config purposes that are not particularly interesting unless you're actually writing code that switches to 32-bit mode or whatever.

(Except you can set a non-zero base for FS and/or GS, and Linux does so for thread-local storage. But this has nothing to do with how copy_from_user() is implemented, or anything. It only has to check that pointer value, not with reference to any segment or the CPL / RPL of a segment descriptor.)

In 32-bit legacy mode, it is possible to write a kernel that uses a segmented memory model, but none of the mainstream OSes actually did that. Some people wish that had become a thing, though, e.g. see this answer lamenting x86-64 making a Multics-style OS impossible. But this is not how Linux works.

Linux is a https://wiki.osdev.org/Higher_Half_Kernel, where kernel pointers have one range of values (the red part) and user-space addresses are in the green part. The kernel can simple dereference user-space addresses if the right user-space page-tables are mapped, it doesn't need to translate them or do anything with segments; this is what it means to have a flat memory model. (The kernel can use "user" page-table entries, but not vice versa). For x86-64 specifically, see https://www.kernel.org/doc/Documentation/x86/x86_64/mm.txt for the actual memory map.

The only reason those 4 GDT entries all need to be separate is for privilege-level reasons, and that the data vs. code segments descriptors have different formats. (A GDT entry contains more than just the base/limit; those are the parts that need to be different. See https://wiki.osdev.org/Global_Descriptor_Table)

And especially https://wiki.osdev.org/Segmentation#Notes_Regarding_C which describes how and why the GDT is typically used by a "normal" OS to create a flat memory model, with a pair of code and data descriptors for each privilege level.

For a 32-bit Linux kernel, only gs gets a non-zero base for thread-local storage (so addressing modes like [gs: 0x10] will access a linear address that depends on the thread that executes it). Or in a 64-bit kernel (and 64-bit user-space), Linux uses fs. (Because x86-64 made GS special with the swapgs instruction, intended for use with syscall for the kernel to find the kernel stack.)

But anyway, the non-zero base for FS or GS are not from a GDT entry, they're set with the wrgsbase instruction. (Or on CPUs that don't support that, with a write to an MSR).

but what are those flags, namely 0xc09b, 0xa09b and so on ? I tend to believe they are the segments selectors

No, segment selectors are indices into the GDT. The kernel is defining the GDT as a C array, using designated-initializer syntax like [GDT_ENTRY_KERNEL32_CS] = initializer_for_that_selector.

(Actually the low 2 bits of a selector, i.e. segment register value, are the current privilege level. So GDT_ENTRY_DEFAULT_USER_CS should be `__USER_CS >> 2.)

mov ds, eax triggers the hardware to index the GDT, not linear search it for matching data in memory!

GDT data format:

You're looking at x86-64 Linux source code, so the kernel will be in long mode, not protected mode. We can tell because there are separate entries for USER_CS and USER32_CS. The 32-bit code segment descriptor will have its L bit cleared. The current CS segment description is what puts an x86-64 CPU into 32-bit compat mode vs. 64-bit long mode. To enter 32-bit user-space, an iret or sysret will set CS:RIP to a user-mode 32-bit segment selector.

I think you can also have the CPU in 16-bit compat mode (like compat mode not real mode, but the default operand-size and address size are 16). Linux doesn't do this, though.

Anyway, as explained in https://wiki.osdev.org/Global_Descriptor_Table and Segmentation,

Each segment descriptor contains the following information:

The base address of the segment

The default operation size in the segment (16-bit/32-bit)

The privilege level of the descriptor (Ring 0 -> Ring 3)

The granularity (Segment limit is in byte/4kb units)

The segment limit (The maximum legal offset within the segment)

The segment presence (Is it present or not)

The descriptor type (0 = system; 1 = code/data)

The segment type (Code/Data/Read/Write/Accessed/Conforming/Non-Conforming/Expand-Up/Expand-Down)

These are the extra bits. I'm not particularly interested in which bits are which because I (think I) understand the high level picture of what different GDT entries are for and what they do, without getting into the details of how that's actually encoded.

But if you check the x86 manuals or the osdev wiki, and the definitions for those init macros, you should find that they result in a GDT entry with the L bit set for 64-bit code segments, cleared for 32-bit code segments. And obviously the type (code vs. data) and privilege level differ.

回答2:

Disclaimer

I am posting this answer to clear this topic of any misconceptions (as pointed out by @PeterCordes).

Paging

The memory management in Linux (x86 protected mode) uses paging for mapping the physical addresses to a virtualized flat linear address space, from 0x00000000 to 0xFFFFFFFF (on 32-bit), known as the flat memory model. Linux, along with the CPU's MMU (Memory Management Unit), will maintain every virtual and logical address mapped 1:1 to the corresponding physical address. The physical memory is usually split into 4KiB pages, to allow an easier management of memory.

The kernel virtual addresses can be contiguous kernel logical addresses directly mapped into contiguous physical pages; other kernel virtual addresses are fully virtual addresses mapped in not-contiguous physical pages used for large buffer allocations (exceeding the contiguous area on small-memory systems) and/or PAE memory (32-bit only). MMIO ports (Memory-Mapped I/O) are also mapped using kernel virtual addresses.

Every dereferenced address must be a virtual address. Either it is a logical or a fully virtual address, physical RAM and MMIO ports are mapped in the virtual address space prior to use.

The kernel obtains a chunk of virtual memory using kmalloc(), pointed by a virtual address, but more importantly, that is also a kernel logical address, meaning it has direct mapping to contiguous physical pages (thus suitable for DMA). On the other hand, the vmalloc() routine will return a chunk of fully virtual memory, pointed by a virtual address, but only contiguous on the virtual address space and mapped to not-contiguous physical pages.

Kernel logical addresses use a fixed mapping between physical and virtual address space. This means virtually-contiguous regions are by nature also physically contiguous. This is not the case with fully virtual addresses, which point to not-contiguous physical pages.

The user virtual addresses - unlike kernel logical addresses - do not use a fixed mapping between virtual and physical addresses, userland processes make full use of the MMU:

Only used portions of physical memory are mapped;
Memory is not-contiguous;
Memory may be swapped out;
Memory can be moved;

In more details, physical memory pages of 4KiB are mapped to virtual addresses in the OS page table, each mapping known as a PTE (Page Table Entry). The CPU's MMU will then keep a cache of each recently used PTEs from the OS page table. This caching area, is known as the TLB (Translation Lookaside Buffer). The cr3 register is used to locate the OS page table.

Whenever a virtual address needs to be translated into a physical one, the TLB will be searched. If a match is found (TLB hit), the physical address is returned and accessed. However, if there is no match (TLB miss), the TLB miss handler will look up the page table to see whether a mapping exists (page walk). If one exists, it is written back to the TLB and the faulting instruction is restarted, this subsequent translation will then find a TLB hit and the memory access will continue. This is known as a minor page fault.

Sometimes, the OS may need to increase the size of physical RAM by moving pages into the hard disk. If a virtual address resolve to a page mapped in the hard disk, the page needs to be loaded in physical RAM prior to be accessed. This is known as a major page fault. The OS page fault handler will then need to find a free page in memory.

The translation process may fail if there is no mapping available for the virtual address, meaning that the virtual address is invalid. This is known as an invalid page fault exception, and a segfault will be issued to the process by the OS page fault handler.

Memory segmentation

Real mode

Real mode still uses a 20-bit segmented memory address space, with 1MiB of addressable memory (0x00000 - 0xFFFFF) and unlimited direct software access to all addressable memory, bus addresses, PMIO ports (Port-Mapped I/O) and peripheral hardware. Real mode provides no memory protection, no privilege levels and no virtualized addresses. Typically, a segment register contains the segment selector value, and the memory operand is an offset value relative to the segment base.

To work around segmentation (C compilers usually only support the flat memory model), C compilers used the unofficial far pointer type to represent a physical address with a segment:offset logical address notation. For instance, the logical address 0x5555:0x0005, after computing 0x5555 * 16 + 0x0005 yields the 20-bit physical address 0x55555, usable in a far pointer as shown below:

char far    *ptr;           /* declare a far pointer */
ptr = (char far *)0x55555;  /* initialize a far pointer */

As of today, most modern x86 CPUs still start in real mode for backwards compatibility and switch to protected mode thereafter.

Protected mode

In protected mode, with the flat memory model, segmentation is unused. The four segments, namely __KERNEL_CS, __KERNEL_DS, __USER_CS, __USER_DS all have their base addresses set to 0. These segments are just legacy baggage from the former x86 model where segmented memory management was used. In protected mode, since all segments base addresses are set to 0, logical addresses are equivalent to linear addresses.

Protected mode with the flat memory model means no segmentation. The only exception where a segment has its base address set to a value other than 0 is when thread-local storage is involved. The FS (and GS on 64-bit) segment registers are used for this purpose.

However, segment registers such as SS (stack segment register), DS (data segment register) or CS (code segment register) are still present and used to store 16-bit segment selectors, which contain indexes to segment descriptors in the LDT and GDT (Local & Global Descriptor Table).

Each instruction that touches memory implicitly uses a segment register. Depending on the context, a particular segment register is used. For instance, the JMP instruction uses CS while PUSH uses SS. Selectors can be loaded into registers with instructions like MOV, the sole exception being the CS register which is only modified by instructions affecting the flow of execution, like CALL or JMP.

The CS register is particularly useful because it keeps track in of the CPL (Current Privilege Level) in its segment selector, thus conserving the privilege level for the present segment. This 2-bit CPL value is always equivalent to the CPU current privilege level.

Memory protection

Paging

The CPU privilege level, also known as the mode bit or protection ring, from 0 to 3, restricts some instructions that can subvert the protection mechanism or cause chaos if allowed in user mode, so they are reserved to the kernel. An attempt to run them outside of ring 0 causes a general-protection fault exception, same scenario when a invalid segment access error occurs (privilege, type, limit, read/write rights). Likewise, any access to memory and MMIO devices is restricted based on privilege level and every attempt to access a protected page without the required privilege level will cause a page fault exception.

The mode bit will be automatically switched from user mode to supervisor mode whenever an interrupt request (IRQ), either software (ie. syscall) or hardware, occurs.

On a 32-bit system, only 4GiB of memory can be effectively addressed, and the memory is split in a 3GiB/1GiB form. Linux (with paging enabled) uses a protection schema known as the higher half kernel where the flat addressing space is divided into two ranges of virtual addresses:

Addresses in the range 0xC0000000 - 0xFFFFFFFF are kernel virtual addresses (red area). The 896MiB range 0xC0000000 - 0xF7FFFFFF directly maps kernel logical addresses 1:1 with kernel physical addresses into the contiguous low-memory pages (using the __pa() and __va() macros). The remaining 128MiB range 0xF8000000 - 0xFFFFFFFF is then used to map virtual addresses for large buffer allocations, MMIO ports (Memory-Mapped I/O) and/or PAE memory into the not-contiguous high-memory pages (using ioremap() and iounmap()).
Addresses in the range 0x00000000 - 0xBFFFFFFF are user virtual addresses (green area), where userland code, data and libraries reside. The mapping can be in not-contiguous low-memory and high-memory pages.

High-memory is only present on 32-bit systems. All memory allocated with kmalloc() has a logical virtual address (with a direct physical mapping); memory allocated by vmalloc() has a fully virtual address (but no direct physical mapping). 64-bit systems have a huge addressing capability hence does not need high-memory, since every page of physical RAM can be effectively addressed.

The boundary address between the supervisor higher half and the userland lower half is known as TASK_SIZE_MAX in the Linux kernel. The kernel will check that every accessed virtual address from any userland process resides below that boundary, as seen in the code below:

static int fault_in_kernel_space(unsigned long address)
{
    /*
     * On 64-bit systems, the vsyscall page is at an address above
     * TASK_SIZE_MAX, but is not considered part of the kernel
     * address space.
     */
    if (IS_ENABLED(CONFIG_X86_64) && is_vsyscall_vaddr(address))
        return false;

    return address >= TASK_SIZE_MAX;
}

If an userland process tries to access a memory address higher than TASK_SIZE_MAX, the do_kern_addr_fault() routine will call the __bad_area_nosemaphore() routine, eventually signaling the faulting task with a SIGSEGV (using get_current() to get the task_struct):

/*
 * To avoid leaking information about the kernel page table
 * layout, pretend that user-mode accesses to kernel addresses
 * are always protection faults.
 */
if (address >= TASK_SIZE_MAX)
    error_code |= X86_PF_PROT;

force_sig_fault(SIGSEGV, si_code, (void __user *)address, tsk); /* Kill the process */

Pages also have a privilege bit, known as the User/Supervisor flag, used for SMAP (Supervisor Mode Access Prevention) in addition to the Read/Write flag that SMEP (Supervisor Mode Execution Prevention) uses.

Segmentation

Older architectures using segmentation usually perform segment access verification using the GDT privilege bit for each requested segment. The privilege bit of the requested segment, known as the DPL (Descriptor Privilege Level), is compared to the CPL of the current segment, ensuring that CPL <= DPL. If true, the memory access is then allowed to the requested segment.

来源：https://stackoverflow.com/questions/56213569/linux-memory-segmentation

标签

Linux