cpu-cache | 易学教程

Speed up random memory access using prefetch

阅读更多关于 Speed up random memory access using prefetch

问题 I am trying to speed up a single program by using prefetches. The purpose of my program is just for test. Here is what it does: It uses two int buffers of the same size It reads one-by-one all the values of the first buffer It reads the value at the index in the second buffer It sums all the values taken from the second buffer It does all the previous steps for bigger and bigger At the end, I print the number of voluntary and involuntary CPU In the very first time, values in the first buffers

Which one will workload(usage) of the CPU-Core if there is a persistent cache-miss, will be 100%?

阅读更多关于 Which one will workload(usage) of the CPU-Core if there is a persistent cache-miss, will be 100%?

问题 That is, if the core processor most of the time waiting for data from RAM or cache-L3 with cache-miss, but the system is a real-time (real-time thread priority), and the thread is attached (affinity) to the core and works without switching thread/context, what kind of load(usage) CPU-Core should show on modern x86_64 ? That is, CPU usage is displayed as decrease only when logged in Idle ? And if anyone knows, if the behavior is different in this case for other processors: ARM, Power[PC],

What is PDE cache?

阅读更多关于 What is PDE cache?

问题 I have the following specifications of an ARM based SoC: L1 Data cache = 32 KB, 64 B/line, 2-WAY, LRU L2 Cache = 1 MB, 64 B/line, 16-WAY L1 Data TLB (for loads): 32 entries, fully associative L2 Data TLB: 512 entries, 4-WAY PDE Cache: 16 entries (one entry per 1 MB of virtual space) And I wonder what is the PDE cache? I guess it's something similar to TLB, but I'm not sure. Answer It seems that PDE (Page Directory Entry) is Intermediate table walk cache which indeed can be implemented

Does clflush also remove TLB entries?

阅读更多关于 Does clflush also remove TLB entries?

问题 Does clflush 1 also flush associated TLB entries? I would assume not since clflush operates at a cache-line granularity, while TLB entries exist at the (much larger) page granularity - but I am prepared to be suprised. 1 ... or clflushopt although one would reasonably assume their behaviors are the same. 回答1: I think it's safe to assume no; baking invlpg into clflush sounds like an insane design decision that I don't think anyone would make. You often want to invalidate multiple lines in a

Getting cache details in ARM processors - Linux

阅读更多关于 Getting cache details in ARM processors - Linux

问题 On Intel processors Linux linux-epq2.site 3.7.10-1.11-desktop #1 SMP PREEMPT Thu May 16 20:27:27 UTC 2013 (adf31bb) x86_64 x86_64 x86_64 GNU/Linux to fetch the cache details: cat /sys/devices/system/cpu/cpu*/cache/index*/ Where * is the respective cpu and cache index numbers However, on ARM processors, this file/folder is not available. Is there a way to fetch these details? Linux arndale 3.9.0-rc5+ #8 SMP Tue Apr 9 12:40:32 CEST 2013 armv7l GNU/Linux 回答1: From ARMv8A (64bit), it is possible

Intel PMU event for L1 cache hit event

阅读更多关于 Intel PMU event for L1 cache hit event

问题 I'm trying to count the number of cache hit at different levels (L1, L2 and L3) of cache for a program on Intel Haswell processor. I wrote a program to count the number of L2 and L3 cache hits by monitoring the respective events. To achieve that, I checked Intel x86 Software Development Manual and used the cache_all_request event and cache_miss event for L2 and L3 cache. However, I didn't find the events for L1 cache. Maybe I missed something? My questions are: Which Event Number and UMASK

Cache-friendly way to collect results from multiple threads

阅读更多关于 Cache-friendly way to collect results from multiple threads

问题 Consider N threads doing some asynchronous tasks with small result value like double or int64_t . So about 8 result values can fit a single CPU cache line. N is equal to the number of CPU cores. On one hand, if I just allocate an array of N items, each a double or int64_t , then 8 threads will share a CPU cache line, which seems inefficient. On the other hand, if I allocate a whole cache line for each double / int64_t , the receiver thread will have to fetch N cache lines, each written by a

When L1 misses are a lot different than L2 accesses… TLB related?

阅读更多关于 When L1 misses are a lot different than L2 accesses… TLB related?

问题 I have been running some benchmarks on some algorithms and profiling their memory usage and efficiency (L1/L2/TLB accesses and misses), and some of the results are quite intriguing for me. Considering an inclusive cache hierarchy (L1 and L2 caches), shouldn't the number of L1 cache misses coincide with the number of L2 cache accesses ? One of the explanations I find would be TLB related: when a virtual address is not mapped in TLB, the system automatically skips searches in some cache levels.

Optimising Java objects for CPU cache line efficiency

阅读更多关于 Optimising Java objects for CPU cache line efficiency

问题 I'm writing a library where: It will need to run on a wide range of different platforms / Java implementations (the common case is likely to be OpenJDK or Oracle Java on Intel 64 bit machines with Windows or Linux) Achieving high performance is a priority , to the extent that I care about CPU cache line efficiency in object access In some areas, quite large graphs of small objects will be traversed / processed (let's say around 1GB scale) The main workload is almost exclusively reads Reads

Optimising Java objects for CPU cache line efficiency

阅读更多关于 Optimising Java objects for CPU cache line efficiency