cpu-cache | 易学教程

Is it necessary to flush write combine memory explicitly by programmer?

阅读更多关于 Is it necessary to flush write combine memory explicitly by programmer?

问题 I know that write combine writes will be cached, and don't reach the memory directly. But is it necessary for the programmer to flush this memory explicitly before others can access? I got this question from the graphics driver code. For example, CPU fills the vertex buffer(mapped as WC). But before GPU access it, I don't see any flush operation in the code. Have the architecture(x86) already taken care of this for us? Any more detail document about this? 回答1: According to Intel® 64 and IA-32

How to abandon (invalidate without saving) a cache line on x86_64?

阅读更多关于 How to abandon (invalidate without saving) a cache line on x86_64?

问题 As I understand, _mm_clflush() / _mm_clflushopt() invalidates a cache line while saving it to memory if it has been changed. Is there a way to simply abandon a cache line, without saving to memory any changes made to it? A use case is before freeing memory: I don't need cache lines or their values anymore. 来源： https://stackoverflow.com/questions/45987746/how-to-abandon-invalidate-without-saving-a-cache-line-on-x86-64

Machine code alignment

阅读更多关于 Machine code alignment

问题 I am trying to understand the principles of machine code alignment. I have an assembler implementation which can generate machine code in run-time. I use 16-bytes alignment on every branch destination, but looks like it is not the optimal choice, since I've noticed that if I remove alignment than sometimes same code works faster. I think that something to do with cache line width, so that some commands are cut by a cache line and CPU experiences stalls because of that. So if some bytes of

How far should one trust hardware counter profiling using VsPerfCmd.exe?

阅读更多关于 How far should one trust hardware counter profiling using VsPerfCmd.exe?

问题 I'm attempting to use VsPerfCmd.exe to profile branch misprediction and last level cache misses in an instrumented native application. The setup works as it says on the tin, but the results I'm getting don't seem sensible. For instance, a function that always touches a data set of 24MB is reported to only cause ~700 cache misses when being called ~2000 times. Now let me put this into perspective - The function linearly traverses two arrays of 1024*1024 elements of 12-byte elements. For every

How do I map a memory address to a block when there is an offset in a direct-mapped cache?

阅读更多关于 How do I map a memory address to a block when there is an offset in a direct-mapped cache?

问题 To start off, the first cache has 16 one-word blocks. As an example I will use 0x03 memory reference. The index has 4 bits (0011). It is clear that the bits equal 3mod16 (0011 = 0x03 = 3). However I am getting confused using this mod equation to determine block location in a cache with offset bits. The second cache has a total size of eight two-word blocks. This means that there is 1 offset bit. Since there are now 8 blocks, there are only 3 index bits. As an example, I will take the same

Flush/Invalidate range by virtual address; ARMv8; Cache;

阅读更多关于 Flush/Invalidate range by virtual address; ARMv8; Cache;

问题 I'm implementing cache maintenance functions for ARMv8 (Cortex-A53) running in 32 bit mode. There is a problems when I try to flush memory region by using virtual addresses (VA). DCacheFlushByRange looks like this // some init. // kDCacheL1 = 0; kDCacheL2 = 2; while (alignedVirtAddr < endAddr) { // Flushing L1 asm volatile("mcr p15, 2, %0, c0, c0, 0" : : "r"(kDCacheL1) :); // select cache isb(); asm volatile("mcr p15, 0, %0, c7, c14, 1" : : "r"(alignedVirtAddr) :); // clean & invalidate dsb()

Cache flush on CyclicBarrier or CountDownLatch like when using synchronized keyword

阅读更多关于 Cache flush on CyclicBarrier or CountDownLatch like when using synchronized keyword

问题 Is there some way how to ensure that java flushes the cache of writes that have been done before the CyclicBarrier or CountDownLatch allows us to continue (as the synchronized keyword does) without using the synchronized keyword? 回答1: I think it is already guaranteed by the API. http://download.oracle.com/javase/6/docs/api/java/util/concurrent/CyclicBarrier.html Memory consistency effects: Actions in a thread prior to calling await() happen-before actions that are part of the barrier action,

Concurrent stores seen in a consistent order

阅读更多关于 Concurrent stores seen in a consistent order

问题 The Intel Architectures Software Developer's Manual, Aug. 2012, vol. 3A, sect. 8.2.2: Any two stores are seen in a consistent order by processors other than those performing the stores. But can this be so? The reason I ask is this: Consider a dual-core Intel i7 processor with HyperThreading. According to the Manual's vol. 1, Fig. 2-8, the i7's logical processors 0 and 1 share an L1/L2 cache, but its logical processors 2 and 3 share a different L1/L2 cache -- whereas all the logical processors

Is the MESI protocol enough, or are memory barriers still required? (Intel CPUs)

阅读更多关于 Is the MESI protocol enough, or are memory barriers still required? (Intel CPUs)

问题 I found an intel document which states memory barriers are required when string (not std::string , but assembly string instructions) are used, to prevent them being re-ordered by the CPU. However, are memory barriers also required when two threads (on two different cores) are accessing the same memory? The scenario I had in mind is where one of the CPUs which doesn't "own" the cache line writes to this memory and the core writes to its store buffer (as opposed to its cache). A memory barrier

Auto optimisation for L cache for object's variables?

阅读更多关于 Auto optimisation for L cache for object's variables?

问题 Frankly, this is a continue of this my question, inspired by this answer: https://stackoverflow.com/a/53262717/1479414 Let's suppose we have a class: public class Foo { private Integer x; public void setX(Integer x) { this.x = x; } public Integer getX() { return this.x; } } And let us consider a very specific scenario , when we have just two threads which interact with the x variable: At time 1, a thread T1 is created At time 2, T1 sets the value: foo.setX(123); At time 3, a thread T2 is