Why is thread local storage not implemented with page table mappings?

问题

I was hoping to use the C++11 thread_local keyword for a per-thread boolean flag that is going to be accessed very frequently.

However, most compilers seem to implemented thread local storage with a table that maps integer IDs (slots) to the variable's address on the current thread. This lookup would happen inside a performance-critical code path, so I have some concerns about its performance.

The way I would have expected thread local storage to be implemented is by allocating virtual memory ranges that are backed by different physical pages depending on the thread. That way, accessing the flag would be the same cost as any other memory access, since the MMU takes care of the mapping.

Why do none of the mainstream compilers take advantage of page table mappings in this way?

I suppose I can implement my own "thread-specific page" with mmap on Linux and VirtualAlloc on Win32, but this seems like a pretty common use-case. If anyone knows of existing or better solutions, please point me to them.

I've also considered storing an std::atomic<std::thread::id> inside each object to represent the active thread, but profiling shows that the check for std::this_thread::get_id() == active_thread is quite expensive.

回答1:

On Linux/x86-64 thread local storage is implemented thru a special segment register %fs (per x86-64 ABI page 23...)

So the following code (I'm using C + GCC extension __thread syntax, but it is the same as C++11 thread_local)

__thread int x;
int f(void) { return x; }

is compiled (with gcc -O -fverbose-asm -S) into:

         .text
 .Ltext0:
         .globl  f
         .type   f, @function
 f:
 .LFB0:
         .file 1 "tl.c"
         .loc 1 3 0
         .cfi_startproc
         .loc 1 3 0
         movl    %fs:x@tpoff, %eax       # x,
         ret
         .cfi_endproc
 .LFE0:
         .size   f, .-f
         .globl  x
         .section        .tbss,"awT",@nobits
         .align 4
         .type   x, @object
         .size   x, 4
 x:
         .zero   4

Therefore, contrarily to your fears, access to TLS is really quick on Linux/x86-64. It is not exactly implemented as a table (instead the kernel & runtime manage the %fs segment register to point to a thread-specific memory zone, and the compiler & linker manage the offset there). However, old pthread_getspecific indeed went thru a table, but is nearly useless once you have TLS.

BTW, by definition, all threads in the same process share the same address space in virtual memory, since a process has its own single address space. (see /proc/self/maps etc... see proc(5) for more about /proc/, and also mmap(2); the C++11 thread library is based on pthreads which are implemented using clone(2)). So "thread-specific memory mapping" is a contradiction: once a task (the thing which is run by the kernel scheduler) has its own address space, it is called a process (not a thread). The defining characteristic of threads in the same process is to share a common address space (and some other entities, like file descriptors).

回答2:

Main-stream operating systems like Linux, OSX, Windows make page-mapping a per-process property, not per-thread. There is a very good reason for that, the page mapping tables are stored in RAM and reading it to calculate the effective physical address would be excessively expensive if this has to be done for every instruction.

So the processor doesn't, it keeps a copy of the recently used mapping table entries in fast memory that's close to the execution core. Called the TLB cache.

Invalidating the TLB cache is very expensive, it has to be reloaded from RAM with low odds that the data is available in one of the memory caches. The processor can be stalled for thousands of cycles when this needs to happen.

So your proposed scheme is in fact likely to be very inefficient, assuming an operating system would support it, using an indexed lookup is cheaper. Processors are very good at simple math, happens at gigahertz, accessing memory happens in megahertz.

回答3:

The suggestion doesn't work, because it would prevent other threads from accessing your thread_local variables via a pointer. Those threads would end up accessing their own copy of that variable.

Say for example that you have a main thread and 100 worker threads. The worker_threads pass a pointer to their own thread_local variable back to the main thread. The main thread now has 100 pointers to those 100 variables. If the TLS memory was page-table mapped as suggested, the main thread would have 100 identical pointers to a single, uninitialized variable in the TLS of the main thread - certainly not what was intended!

回答4:

Memory-mappings are not per-thread but per-process. All threads would share the same mapping.

The kernel could offer per-thread mappings but it presently does not.

回答5:

You are using C++. Have a thread object per thread, with the working procedure of the thread and all/most functions called by it being member functions of that object. Then you can have thread ID or any other thread-specific data as member variables.

回答6:

One contemporary concern is hardware constraints(though, I'm sure this predates the situations below).

On SPARC T5 processors, each hardware thread has its own MMU, but shares a TLB with up to seven sibling threads on the same core, and that TLB can get thrashed pretty hard.

On MIPS different memory mappings for threads can force them to be serialized to a single virtual thread execution context. This is because hardware thread contexts share an MMU. The kernel already can't run multiple processes on neighboring thread contexts, and separate memory mappings per thread would have the same limitation.

来源：https://stackoverflow.com/questions/26437921/why-is-thread-local-storage-not-implemented-with-page-table-mappings

标签

c++

multithreading

performance

c++11

thread-local-storage