cpu-cache

How are cache memories shared in multicore Intel CPUs?

倖福魔咒の 提交于 2019-11-26 19:02:00
问题 I have a few questions regarding Cache memories used in Multicore CPUs or Multiprocessor systems. (Although not directly related to programming, it has many repercussions while one writes software for multicore processors/multiprocessors systems, hence asking here!) In a multiprocessor system or a multicore processor (Intel Quad Core, Core two Duo etc..) does each cpu core/processor have its own cache memory (data and program cache)? Can one processor/core access each other's cache memory,

Why does the speed of memcpy() drop dramatically every 4KB?

爱⌒轻易说出口 提交于 2019-11-26 18:49:14
问题 I tested the speed of memcpy() noticing the speed drops dramatically at i*4KB. The result is as follow: the Y-axis is the speed(MB/second) and the X-axis is the size of buffer for memcpy() , increasing from 1KB to 2MB. Subfigure 2 and Subfigure 3 detail the part of 1KB-150KB and 1KB-32KB. Environment: CPU : Intel(R) Xeon(R) CPU E5620 @ 2.40GHz OS : 2.6.35-22-generic #33-Ubuntu GCC compiler flags : -O3 -msse4 -DINTEL_SSE4 -Wall -std=c99 I guess it must be related to caches, but I can't find a

Are there any modern CPUs where a cached byte store is actually slower than a word store?

左心房为你撑大大i 提交于 2019-11-26 17:52:07
问题 It's a common claim that a byte store into cache may result in an internal read-modify-write cycle, or otherwise hurt throughput or latency vs. storing a full register. But I've never seen any examples. No x86 CPUs are like this, and I think all high-performance CPUs can directly modify any byte in a cache-line, too. Are some microcontrollers or low-end CPUs different, if they have cache at all? ( I'm not counting word-addressable machines , or Alpha which is byte addressable but lacks byte

How can I do a CPU cache flush in x86 Windows?

陌路散爱 提交于 2019-11-26 17:11:17
I am interested in forcing a CPU cache flush in Windows (for benchmarking reasons, I want to emulate starting with no data in CPU cache), preferably a basic C implementation or Win32 call. Is there a known way to do this with a system call or even something as sneaky as doing say a large memcpy ? Intel i686 platform (P4 and up is okay as well). Gunther Piez Fortunately, there is more than one way to explicitly flush the caches. The instruction "wbinvd" writes back modified cache content and marks the caches empty. It executes a bus cycle to make external caches flush their data. Unfortunately,

clflush to invalidate cache line via C function

a 夏天 提交于 2019-11-26 16:45:37
I am trying to use clflush to manually evicts a cache line in order to determine cache and line sizes. I didn't find any guide on how to use that instruction. All I see, are some codes that use higher level functions for that purpose. There is a kernel function void clflush_cache_range(void *vaddr, unsigned int size) , but still I don't know what to include in my code and how to use that. I don't know what is the size in that function. More than that, how can I be sure that the line is evicted in order to verify the correctness of my code? UPDATE: Here is a initial code for what I am trying to

Non-temporal loads and the hardware prefetcher, do they work together?

你离开我真会死。 提交于 2019-11-26 16:41:22
When executing a series of _mm_stream_load_si128() calls ( MOVNTDQA ) from consecutive memory locations, will the hardware pre-fetcher still kick-in, or should I use explicit software prefetching (with NTA hint) in order to obtain the benefits of prefetching while still avoiding cache pollution? The reason I ask this is because their objectives seem contradictory to me. A streaming load will fetch data bypassing the cache, while the pre-fetcher attempts to proactively fetch data into the cache. When sequentially iterating a large data structure (processed data won't be retouched in a long

Which ordering of nested loops for iterating over a 2D array is more efficient [duplicate]

一笑奈何 提交于 2019-11-26 14:22:00
This question already has an answer here: Why does the order of the loops affect performance when iterating over a 2D array? 7 answers Which of the following orderings of nested loops to iterate over a 2D array is more efficient in terms of time (cache performance)? Why? int a[100][100]; for(i=0; i<100; i++) { for(j=0; j<100; j++) { a[i][j] = 10; } } or for(i=0; i<100; i++) { for(j=0; j<100; j++) { a[j][i] = 10; } } The first method is slightly better, as the cells being assigned to lays next to each other. First method: [ ][ ][ ][ ][ ] .... ^1st assignment ^2nd assignment [ ][ ][ ][ ][ ] ....

Can I force cache coherency on a multicore x86 CPU?

对着背影说爱祢 提交于 2019-11-26 11:56:07
问题 The other week, I wrote a little thread class and a one-way message pipe to allow communication between threads (two pipes per thread, obviously, for bidirectional communication). Everything worked fine on my Athlon 64 X2, but I was wondering if I\'d run into any problems if both threads were looking at the same variable and the local cached value for this variable on each core was out of sync. I know the volatile keyword will force a variable to refresh from memory, but is there a way on

Which cache mapping technique is used in intel core i7 processor?

一个人想着一个人 提交于 2019-11-26 11:23:34
I have learned about different cache mapping technique like direct mapping,associate mapping and set associative mapping technique and also learned the trade-offs. But I am curious what is used in intel core i7 or AMD processor nowadays. And how the techniques are evolved. And what are things that needs to be improved? Direct-mapped caches are basically never used in modern high-performance CPUs . The power savings are outweighed by the large advantage in hit rate for a set-associative cache of the same size, with only a bit more complexity in the control logic. Transistor budgets are very

Why is the size of L1 cache smaller than that of the L2 cache in most of the processors?

≯℡__Kan透↙ 提交于 2019-11-26 10:28:36
Why is the size of L1 cache smaller than that of the L2 cache in most of the processors ? There are different reasons for that. L2 exists in the system to speedup the case where there is a L1 cache miss. If the size of L1 was the same or bigger than the size of L2, then L2 could not accomodate for more cache lines than L1, and would not be able to deal with L1 cache misses. From the design/cost perspective, L1 cache is bound to the processor and faster than L2. The whole idea of caches is that you speed up access to the slower hardware by adding intermediate hardware that is more performing