cpu-cache

Is it necessary to flush write combine memory explicitly by programmer?

旧城冷巷雨未停 提交于 2019-12-12 13:23:49
问题 I know that write combine writes will be cached, and don't reach the memory directly. But is it necessary for the programmer to flush this memory explicitly before others can access? I got this question from the graphics driver code. For example, CPU fills the vertex buffer(mapped as WC). But before GPU access it, I don't see any flush operation in the code. Have the architecture(x86) already taken care of this for us? Any more detail document about this? 回答1: According to Intel® 64 and IA-32

How to abandon (invalidate without saving) a cache line on x86_64?

安稳与你 提交于 2019-12-12 12:19:17
问题 As I understand, _mm_clflush() / _mm_clflushopt() invalidates a cache line while saving it to memory if it has been changed. Is there a way to simply abandon a cache line, without saving to memory any changes made to it? A use case is before freeing memory: I don't need cache lines or their values anymore. 来源: https://stackoverflow.com/questions/45987746/how-to-abandon-invalidate-without-saving-a-cache-line-on-x86-64

Machine code alignment

三世轮回 提交于 2019-12-12 08:55:11
问题 I am trying to understand the principles of machine code alignment. I have an assembler implementation which can generate machine code in run-time. I use 16-bytes alignment on every branch destination, but looks like it is not the optimal choice, since I've noticed that if I remove alignment than sometimes same code works faster. I think that something to do with cache line width, so that some commands are cut by a cache line and CPU experiences stalls because of that. So if some bytes of

How far should one trust hardware counter profiling using VsPerfCmd.exe?

自闭症网瘾萝莉.ら 提交于 2019-12-12 05:47:26
问题 I'm attempting to use VsPerfCmd.exe to profile branch misprediction and last level cache misses in an instrumented native application. The setup works as it says on the tin, but the results I'm getting don't seem sensible. For instance, a function that always touches a data set of 24MB is reported to only cause ~700 cache misses when being called ~2000 times. Now let me put this into perspective - The function linearly traverses two arrays of 1024*1024 elements of 12-byte elements. For every

How do I map a memory address to a block when there is an offset in a direct-mapped cache?

我的未来我决定 提交于 2019-12-12 04:28:32
问题 To start off, the first cache has 16 one-word blocks. As an example I will use 0x03 memory reference. The index has 4 bits (0011). It is clear that the bits equal 3mod16 (0011 = 0x03 = 3). However I am getting confused using this mod equation to determine block location in a cache with offset bits. The second cache has a total size of eight two-word blocks. This means that there is 1 offset bit. Since there are now 8 blocks, there are only 3 index bits. As an example, I will take the same

Flush/Invalidate range by virtual address; ARMv8; Cache;

允我心安 提交于 2019-12-11 07:28:48
问题 I'm implementing cache maintenance functions for ARMv8 (Cortex-A53) running in 32 bit mode. There is a problems when I try to flush memory region by using virtual addresses (VA). DCacheFlushByRange looks like this // some init. // kDCacheL1 = 0; kDCacheL2 = 2; while (alignedVirtAddr < endAddr) { // Flushing L1 asm volatile("mcr p15, 2, %0, c0, c0, 0" : : "r"(kDCacheL1) :); // select cache isb(); asm volatile("mcr p15, 0, %0, c7, c14, 1" : : "r"(alignedVirtAddr) :); // clean & invalidate dsb()

Cache flush on CyclicBarrier or CountDownLatch like when using synchronized keyword

倾然丶 夕夏残阳落幕 提交于 2019-12-11 03:54:52
问题 Is there some way how to ensure that java flushes the cache of writes that have been done before the CyclicBarrier or CountDownLatch allows us to continue (as the synchronized keyword does) without using the synchronized keyword? 回答1: I think it is already guaranteed by the API. http://download.oracle.com/javase/6/docs/api/java/util/concurrent/CyclicBarrier.html Memory consistency effects: Actions in a thread prior to calling await() happen-before actions that are part of the barrier action,

Concurrent stores seen in a consistent order

馋奶兔 提交于 2019-12-11 03:36:34
问题 The Intel Architectures Software Developer's Manual, Aug. 2012, vol. 3A, sect. 8.2.2: Any two stores are seen in a consistent order by processors other than those performing the stores. But can this be so? The reason I ask is this: Consider a dual-core Intel i7 processor with HyperThreading. According to the Manual's vol. 1, Fig. 2-8, the i7's logical processors 0 and 1 share an L1/L2 cache, but its logical processors 2 and 3 share a different L1/L2 cache -- whereas all the logical processors

Is the MESI protocol enough, or are memory barriers still required? (Intel CPUs)

允我心安 提交于 2019-12-11 03:19:48
问题 I found an intel document which states memory barriers are required when string (not std::string , but assembly string instructions) are used, to prevent them being re-ordered by the CPU. However, are memory barriers also required when two threads (on two different cores) are accessing the same memory? The scenario I had in mind is where one of the CPUs which doesn't "own" the cache line writes to this memory and the core writes to its store buffer (as opposed to its cache). A memory barrier

Auto optimisation for L cache for object's variables?

穿精又带淫゛_ 提交于 2019-12-10 19:49:44
问题 Frankly, this is a continue of this my question, inspired by this answer: https://stackoverflow.com/a/53262717/1479414 Let's suppose we have a class: public class Foo { private Integer x; public void setX(Integer x) { this.x = x; } public Integer getX() { return this.x; } } And let us consider a very specific scenario , when we have just two threads which interact with the x variable: At time 1, a thread T1 is created At time 2, T1 sets the value: foo.setX(123); At time 3, a thread T2 is