cpu-cache

Interconnect between per-core L2 and L3 in Core i7

柔情痞子 提交于 2019-12-01 06:39:29
The Intel core i7 has per-core L1 and L2 caches, and a large shared L3 cache. I need to know what kind of an interconnect connects the multiple L2s to the single L3. I am a student, and need to write a rough behavioral model of the cache subsystem. Is it a crossbar? A single bus? a ring? The references I came across mention structural details of the caches, but none of them mention what kind of on-chip interconnect exists. Thanks, -neha Modern i7's use a ring. From Tom's Hardware : Earlier this year, I had the chance to talk to Sailesh Kottapalli, a senior principle engineer at Intel, who

What is the difference in logic and performance between LOCK XCHG and MOV+MFENCE? [duplicate]

两盒软妹~` 提交于 2019-12-01 05:34:19
问题 This question already has answers here : Are loads and stores the only instructions that gets reordered? (2 answers) Which is a better write barrier on x86: lock+addl or xchgl? (5 answers) Does lock xchg have the same behavior as mfence? (1 answer) Closed last year . What is the difference in logic and performance between x86-instructions LOCK XCHG and MOV+MFENCE for doing a sequential-consistency store. (We ignore the load result of the XCHG ; compilers other than gcc use it for the store +

Why are the user-mode L1 store miss events only counted when there is a store initialization loop?

蓝咒 提交于 2019-12-01 04:19:52
问题 Summary Consider the following loop: loop: movl $0x1,(%rax) add $0x40,%rax cmp %rdx,%rax jne loop where rax is initialized to the address of a buffer that is larger than the L3 cache size. Every iteration performs a store operation to the next cache line. I expect that the number of RFO requests sent from the L1D to the L2 to be more or less equal to the number of cache lines accessed. The problem is that this seems to be only the case when I count kernel-mode events even though the program

Are two consequent CPU stores on x86 flushed to the cache keeping the order?

萝らか妹 提交于 2019-11-30 20:27:04
Assume there are two threads running on x86 CPU0 and CPU1 respectively. Thread running on CPU0 executes the following commands: A=1 B=1 Cache line containing A initially owned by CPU1 and that containing B owned by CPU0. I have two questions: If I understand correctly, both stores will be put into CPU’s store buffer. However, for the first store A=1 the cache of CPU1 must be invalidated while the second store B=1 can be flushed immediately since CPU0 owns the cache line containing it. I know that x86 CPU respects store orders. Does that mean that B=1 will not be written to the cache before A=1

pthread_create(3) and memory synchronization guarantee in SMP architectures

谁说我不能喝 提交于 2019-11-30 20:15:44
问题 I am looking at the section 4.11 of The Open Group Base Specifications Issue 7 (IEEE Std 1003.1, 2013 Edition), section 4.11 document, which spells out the memory synchronization rules. This is the most specific by the POSIX standard I have managed to come by for detailing the POSIX/C memory model. Here's a quote 4.11 Memory Synchronization Applications shall ensure that access to any memory location by more than one thread of control (threads or processes) is restricted such that no thread

What is the best NHibernate cache L2 provider?

廉价感情. 提交于 2019-11-30 15:02:25
I've seen there is a plenty of them. NCache, Velocity and so forth but I haven't found a table comparing them. What's the best considering the following criterias: Easy to understand. Is being maintained lately. Is free or has a good enough free version. Works. I can't speak for what's best or worst, but I'll throw in my experience with NCache in case it helps. Disclaimer: NHibernate and I had some disagreements, we have since gone our separate ways :) The Good The performance was great The support was great, it's well maintained (I'm speaking to status as of ~6 months ago) It has a free

How can the L1, L2, L3 CPU caches be turned off on modern x86/amd64 chips?

时光毁灭记忆、已成空白 提交于 2019-11-30 12:22:00
Every modern high-performance CPU of the x86/x86_64 architecture has some hierarchy of data caches: L1, L2, and sometimes L3 (and L4 in very rare cases), and data loaded from/to main RAM is cached in some of them. Sometimes the programmer may want some data to not be cached in some or all cache levels (for example, when wanting to memset 16 GB of RAM and keep some data still in the cache): there are some non-temporal (NT) instructions for this like MOVNTDQA ( https://stackoverflow.com/a/37092 http://lwn.net/Articles/255364/ ) But is there a programmatic way (for some AMD or Intel CPU families

What use is the INVD instruction?

爱⌒轻易说出口 提交于 2019-11-30 11:02:13
The x86 INVD invalidates the cache hierarchy without writing the contents back to memory, apparently. I'm curious, what use is such an instruction? Given how one has very little control over what data may be in the various cache levels and even less control over what may have already been flushed asynchronously, it seems to be little more than a way to make sure you just don't know what data is held in memory anymore. Excellent question! One use-case for such a blunt-acting instruction as invd is in specialized or very-early-bootstrap code, such as when the presence or absence of RAM has not

Why isn't there a data bus which is as wide as the cache line size?

半城伤御伤魂 提交于 2019-11-30 09:17:02
When a cache miss occurs, the CPU fetches a whole cache line from main memory into the cache hierarchy. (typically 64 bytes on x86_64) This is done via a data bus, which is only 8 byte wide on modern 64 bit systems. (since the word size is 8 byte) EDIT: "Data bus" means the bus between the CPU die and the DRAM modules in this context. This data bus width does not necessarily correlate with the word size. Depending on the strategy the actually requested address gets fetched at first, and then the rest of the cache line gets fetched sequentially. It would seem much faster if there was a bus with

What is a cache hit and a cache miss? Why would context-switching cause cache miss?

别来无恙 提交于 2019-11-29 20:07:53
From the 11th Chapter( Performance and Scalability ) and the section named Context Switching of the JCIP book : When a new thread is switched in, the data it needs is unlikely to be in the local processor cache, so a context-switch causes a flurry of cache misses, and thus threads run a little more slowly when they are first scheduled. Can someone explain in an easy to understand way the concept of cache miss and its probable opposite ( cache hit )? Why context-switching would cause a lot of cache miss? Can someone explain in an easy to understand way the concept of cache miss and its probable