Exclusive access to L1 cacheline on x86?

前端 未结 2 1564
[愿得一人]
[愿得一人] 2021-01-19 12:28

If one has a 64 byte buffer that is heavily read/written to then it\'s likely that it\'ll be kept in L1; but is there any way to force that behaviour?

As in, give o

2条回答
  •  孤独总比滥情好
    2021-01-19 13:17

    There is no direct way to achieve that on Intel and AMD x86 processors, but you can get pretty close with some effort. First, you said you're worried that the cache line might get evicted from the L1 because some other core might access it. This can only happen in the following situations:

    • The line is shared, and therefore, it can be accessed by multiple agents in the system concurrently. If another agent attempts to read the line, its state will change from Modified or Exclusive to Shared. That is, it will state in the L1. If, on the other hand, another agent attempts to write to the line, it has to be invalidated from the L1.
    • The line can be private or shared, but the thread got rescheduled by the OS to run on another core. Similar to the previous case, if it attempts to read the line, its state will change from Modified or Exclusive to Shared in both L1 caches. If it attempts to write to the line, it has to be invalidated from the L1 of the previous core on which it was running.

    There are other reasons why the line may get evicted from the L1 as I will discuss shortly.

    If the line is shared, then you cannot disable coherency. What you can do, however, is make a private copy of it, which effectively does disable coherency. If doing that may lead to faulty behavior, then the only thing you can do is to set the affinity of all threads that share the line to run on the same physical core on a hyperthreaded (SMT) Intel processor. Since the L1 is shared between the logical cores, the line will not get evicted due to sharing, but it can still get evicted due to other reasons.

    Setting the affinity of a thread does not guarantee though that other threads cannot get scheduled to run on the same core. To reduce the probability of scheduling other threads (that don't access the line) on the same core or rescheduling the thread to run on other physical cores, you can increase the priority of the thread (or all the threads that share the line).

    Intel processors are mostly 2-way hyperthreaded, so you can only run two threads that share the line at a time. so if you play with the affinity and priority of the threads, performance can change in interesting ways. You'll have to measure it. Recent AMD processors also support SMT.

    If the line is private (only one thread can access it), a thread running on a sibling logical core in an Intel processor may cause the line to be evicted because the L1 is competitively shared, depending on its memory access behavior. I will discuss how this can be dealt with shortly.

    Another issue is interrupts and exceptions. On Linux and maybe other OSes, you can configure which cores should handle which interrupts. I think it's OK to map all interrupts to all other cores, except the periodic timer interrupt whose interrupt handler's behavior is OS-dependent and it may not be safe to play with it. Depending on how much effort you want to spend on this, you can perform carefully designed experiments to determine the impact of the timer interrupt handler on the L1D cache contents. Also you should avoid exceptions.

    I can think of two reasons why a line might get invalidated:

    • A (potentially speculative) RFO with intent for modification from another core.
    • The line was chosen to be evicted to make space for another line. This depends on the design of the cache hierarchy:
      • The L1 cache placement policy.
      • The L1 cache replacement policy.
      • Whether lower level caches are inclusive or not.

    The replacement policy is commonly not configurable, so you should strive to avoid conflict L1 misses, which depends on the placement policy, which depends on the microarchitecture. On Intel processors, the L1D is typically both virtually indexed and physically indexed because the bits used for the index don't require translation. Since you know the virtual addresses of all memory accesses, you can determine which lines would be allocated from which cache set. You need to make sure that the number of lines mapped to the same set (including the line you don't want it to be evicted) does not exceed the associativity of the cache. Otherwise, you'd be at the mercy of the replacement policy. Note also that an L1D prefetcher can also change the contents of the cache. You can disable it on Intel processors and measure its impact in both cases. I cannot think of an easy way to deal with inclusive lower level caches.

    I think the idea of "pinning" a line in the cache is interesting and can be useful. It's a hybrid between caches and scratch pad memories. The line would be like a temporary register mapped to the virtual address space.

    The main issue here is that you want to both read from and write to the line, while still keeping it in the cache. This sort of behavior is currently not supported.

提交回复
热议问题