cpu-cache | 易学教程

What happens with a non-temporal store if the data is already in cache?

阅读更多关于 What happens with a non-temporal store if the data is already in cache?

问题 When you use non-temporal stores, e.g. movntq, and the data is already in cache, will the store update the cache instead of writing out to memory? Or will it update the cache line and write it out, evicting it? Or what? Here's a fun dilemma. Suppose thread A is loading the cache line containing x and y. Thread B writes to x using a NT store. Thread A writes to y. There's a data race here if B's store to x can be in-transit to memory while A's load is happening. If A sees the old value of x,

How are the modern Intel CPU L3 caches organized?

阅读更多关于 How are the modern Intel CPU L3 caches organized?

问题 Given that CPUs are now multi-core and have their own L1/L2 caches, I was curious as to how the L3 cache is organized given that its shared by multiple cores. I would imagine that if we had, say, 4 cores, then the L3 cache would contain 4 pages worth of data, each page corresponding to the region of memory that a particular core is referencing. Assuming I'm somewhat correct, is that as far as it goes? It could, for example, divide each of these pages into sub-pages. This way when multiple

How do cores decide which cache line to invalidate in MESI?

阅读更多关于 How do cores decide which cache line to invalidate in MESI?

问题 I have some misunderstanding about cache lines. I'm using Haswell and Ubuntu . Now let's say we have 2-threaded application in which the following happens. mov [addr], dword 0xAC763F ;starting Thread 1 and Thread 2 Now let`s say the threads perform the following actions in parallel: Thread 1 Thread 2 mov rax, [addr] mov rax, [addr] mov [addr], dword 1 mov [addr], dword 2 Now in my understanding of what's going on is this: Before starting the main thread writes to the corresponding cache line

Would buffering cache changes prevent Meltdown?

阅读更多关于 Would buffering cache changes prevent Meltdown?

问题 If new CPUs had a cache buffer which was only committed to the actual CPU cache if the instructions are ever committed would attacks similar to Meltdown still be possible? The proposal is to make speculative execution be able to load from memory, but not write to the CPU caches until they are actually committed. 回答1: TL:DR: yes I think it would solve Spectre (and Meltdown) in their current form (using a flush+read cache-timing side channel to copy the secret data from a physical register),

CPU cache: does the distance between two address needs to be smaller than 8 bytes to have cache advantage?

阅读更多关于 CPU cache: does the distance between two address needs to be smaller than 8 bytes to have cache advantage?

问题 It may seem a weird question.. Say the a cache line's size is 64 bytes. Further, assume that L1, L2, L3 has the same cache line size (this post said it's the case for Intel Core i7). There are two objects A , B on memory, whose (physical) addresses are N bytes apart. For simplicity, let's assume A is on the cache boundary, that is, its address is an integer multiple of 64. 1) If N < 64, when A is fetched by CPU, B will be read into the cache, too. So if B is needed, and the cache line is not

Can't sample hardware cache events with linux perf

阅读更多关于 Can't sample hardware cache events with linux perf

问题 For some reason, I can't sample ( perf record ) hardware cache events: # perf record -e L1-dcache-stores -a -c 100 -- sleep 5 [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.607 MB perf.data (~26517 samples) ] # perf script but I can count them ( perf stat ): # perf stat -e L1-dcache-stores -a -- sleep 5 Performance counter stats for 'sleep 5': 711,781 L1-dcache-stores 5.000842990 seconds time elapsed I tried on different CPUs, OS versions (and kernel

Getting cache details in ARM processors - Linux

阅读更多关于 Getting cache details in ARM processors - Linux

On Intel processors Linux linux-epq2.site 3.7.10-1.11-desktop #1 SMP PREEMPT Thu May 16 20:27:27 UTC 2013 (adf31bb) x86_64 x86_64 x86_64 GNU/Linux to fetch the cache details: cat /sys/devices/system/cpu/cpu*/cache/index*/ Where * is the respective cpu and cache index numbers However, on ARM processors, this file/folder is not available. Is there a way to fetch these details? Linux arndale 3.9.0-rc5+ #8 SMP Tue Apr 9 12:40:32 CEST 2013 armv7l GNU/Linux From ARMv8A (64bit), it is possible to get cache info from CLIDR register. So cache info can be populated to /sys file system in Linux. Check

Skylake L2 cache enhanced by reducing associativity?

阅读更多关于 Skylake L2 cache enhanced by reducing associativity?

问题 In Intel's optimization guide, section 2.1.3, they list a number of enhancements to the caches and memory subsystem in Skylake (emphasis mine): The cache hierarchy of the Skylake microarchitecture has the following enhancements: Higher Cache bandwidth compared to previous generations. Simultaneous handling of more loads and stores enabled by enlarged buffers. Processor can do two page walks in parallel compared to one in Haswell microarchitecture and earlier generations. Page split load

CPU measures (Cache misses/hits) which do not make sense

阅读更多关于 CPU measures (Cache misses/hits) which do not make sense

问题 I use Intel PCM for fine-grained CPU measurements. In my code, I am trying to measure the cache efficiency. Basically, I first put a small array into the L1 cache (by traversing it many times), then I fire up the timer, go over the array one more time (which hopefully uses the cache), and then turning off the timer. PCM shows me that I have a rather high L2 and L3 miss ratio. I also checked with rdtscp and the cycles per array operation is 15 (which is much higher than 4-5 cycles for

What is the cache line size on iPhone and iPad?

阅读更多关于 What is the cache line size on iPhone and iPad?

问题 What is the cache line size on iPhone and iPad? And does it vary much between the different devices and CPUs? This is not too easy to find with google. I need to squeeze some extra performance from my app. :) 回答1: Well, the Cortex-A8 has 64-byte lines, Cortex-A9 has 32-byte lines, as for Swift and Cyclone I don't know - looking at comparable cores (A15, A57, Scorpion, Krait) 32 or 64 bytes seems likely. Either way there's at least 2 different lengths across iOS7 machines. As you're