cpu-cache

Speed up random memory access using prefetch

旧街凉风 提交于 2019-12-25 07:34:54
问题 I am trying to speed up a single program by using prefetches. The purpose of my program is just for test. Here is what it does: It uses two int buffers of the same size It reads one-by-one all the values of the first buffer It reads the value at the index in the second buffer It sums all the values taken from the second buffer It does all the previous steps for bigger and bigger At the end, I print the number of voluntary and involuntary CPU In the very first time, values in the first buffers

Which one will workload(usage) of the CPU-Core if there is a persistent cache-miss, will be 100%?

别等时光非礼了梦想. 提交于 2019-12-25 05:00:31
问题 That is, if the core processor most of the time waiting for data from RAM or cache-L3 with cache-miss, but the system is a real-time (real-time thread priority), and the thread is attached (affinity) to the core and works without switching thread/context, what kind of load(usage) CPU-Core should show on modern x86_64 ? That is, CPU usage is displayed as decrease only when logged in Idle ? And if anyone knows, if the behavior is different in this case for other processors: ARM, Power[PC],

What is PDE cache?

天大地大妈咪最大 提交于 2019-12-23 19:24:37
问题 I have the following specifications of an ARM based SoC: L1 Data cache = 32 KB, 64 B/line, 2-WAY, LRU L2 Cache = 1 MB, 64 B/line, 16-WAY L1 Data TLB (for loads): 32 entries, fully associative L2 Data TLB: 512 entries, 4-WAY PDE Cache: 16 entries (one entry per 1 MB of virtual space) And I wonder what is the PDE cache? I guess it's something similar to TLB, but I'm not sure. Answer It seems that PDE (Page Directory Entry) is Intermediate table walk cache which indeed can be implemented

Does clflush also remove TLB entries?

非 Y 不嫁゛ 提交于 2019-12-23 18:57:16
问题 Does clflush 1 also flush associated TLB entries? I would assume not since clflush operates at a cache-line granularity, while TLB entries exist at the (much larger) page granularity - but I am prepared to be suprised. 1 ... or clflushopt although one would reasonably assume their behaviors are the same. 回答1: I think it's safe to assume no; baking invlpg into clflush sounds like an insane design decision that I don't think anyone would make. You often want to invalidate multiple lines in a

Getting cache details in ARM processors - Linux

亡梦爱人 提交于 2019-12-23 03:49:07
问题 On Intel processors Linux linux-epq2.site 3.7.10-1.11-desktop #1 SMP PREEMPT Thu May 16 20:27:27 UTC 2013 (adf31bb) x86_64 x86_64 x86_64 GNU/Linux to fetch the cache details: cat /sys/devices/system/cpu/cpu*/cache/index*/ Where * is the respective cpu and cache index numbers However, on ARM processors, this file/folder is not available. Is there a way to fetch these details? Linux arndale 3.9.0-rc5+ #8 SMP Tue Apr 9 12:40:32 CEST 2013 armv7l GNU/Linux 回答1: From ARMv8A (64bit), it is possible

Intel PMU event for L1 cache hit event

丶灬走出姿态 提交于 2019-12-22 15:37:13
问题 I'm trying to count the number of cache hit at different levels (L1, L2 and L3) of cache for a program on Intel Haswell processor. I wrote a program to count the number of L2 and L3 cache hits by monitoring the respective events. To achieve that, I checked Intel x86 Software Development Manual and used the cache_all_request event and cache_miss event for L2 and L3 cache. However, I didn't find the events for L1 cache. Maybe I missed something? My questions are: Which Event Number and UMASK

Cache-friendly way to collect results from multiple threads

心不动则不痛 提交于 2019-12-21 10:46:30
问题 Consider N threads doing some asynchronous tasks with small result value like double or int64_t . So about 8 result values can fit a single CPU cache line. N is equal to the number of CPU cores. On one hand, if I just allocate an array of N items, each a double or int64_t , then 8 threads will share a CPU cache line, which seems inefficient. On the other hand, if I allocate a whole cache line for each double / int64_t , the receiver thread will have to fetch N cache lines, each written by a

When L1 misses are a lot different than L2 accesses… TLB related?

孤街浪徒 提交于 2019-12-21 04:42:24
问题 I have been running some benchmarks on some algorithms and profiling their memory usage and efficiency (L1/L2/TLB accesses and misses), and some of the results are quite intriguing for me. Considering an inclusive cache hierarchy (L1 and L2 caches), shouldn't the number of L1 cache misses coincide with the number of L2 cache accesses ? One of the explanations I find would be TLB related: when a virtual address is not mapped in TLB, the system automatically skips searches in some cache levels.

Optimising Java objects for CPU cache line efficiency

余生长醉 提交于 2019-12-20 18:27:51
问题 I'm writing a library where: It will need to run on a wide range of different platforms / Java implementations (the common case is likely to be OpenJDK or Oracle Java on Intel 64 bit machines with Windows or Linux) Achieving high performance is a priority , to the extent that I care about CPU cache line efficiency in object access In some areas, quite large graphs of small objects will be traversed / processed (let's say around 1GB scale) The main workload is almost exclusively reads Reads

Optimising Java objects for CPU cache line efficiency

旧城冷巷雨未停 提交于 2019-12-20 18:27:08
问题 I'm writing a library where: It will need to run on a wide range of different platforms / Java implementations (the common case is likely to be OpenJDK or Oracle Java on Intel 64 bit machines with Windows or Linux) Achieving high performance is a priority , to the extent that I care about CPU cache line efficiency in object access In some areas, quite large graphs of small objects will be traversed / processed (let's say around 1GB scale) The main workload is almost exclusively reads Reads