cpu-cache

What is locality of reference?

爱⌒轻易说出口 提交于 2019-11-28 16:24:27
I am having problem in understanding locality of reference. Can anyone please help me out in understanding what it means and what is, Spatial Locality of reference Temporal Locality of reference This would not matter if your computer was filled with super-fast memory. But unfortunately that's not the case and computer-memory looks something like this 1 : +----------+ | CPU | <<-- Our beloved CPU, superfast and always hungry for more data. +----------+ |L1 - Cache| <<-- works at 100% of CPU speed (fast) +----------+ |L2 - Cache| <<-- works at 25% of CPU speed (medium) +----+-----+ | | <<-- This

Cycles/cost for L1 Cache hit vs. Register on x86?

廉价感情. 提交于 2019-11-28 16:24:20
I remember assuming that an L1 cache hit is 1 cycle (i.e. identical to register access time) in my architecture class, but is that actually true on modern x86 processors? How many cycles does an L1 cache hit take? How does it compare to register access? paulsm4 Here's a great article on the subject: http://arstechnica.com/gadgets/reviews/2002/07/caching.ars/1 To answer your question - yes, a cache hit has approximately the same cost as a register access. And of course a cache miss is quite costly ;) PS: The specifics will vary, but this link has some good ballpark figures: Approximate cost to

What's the difference between conflict miss and capacity miss

时光怂恿深爱的人放手 提交于 2019-11-28 16:00:39
Capacity miss occurs because blocks are being discarded from cache because cache cannot contain all blocks needed for program execution (program working set is much larger than cache capacity). Conflict miss occurs in the case of set associative or direct mapped block placement strategies, conflict misses occur when several blocks are mapped to the same set or block frame; also called collision misses or interference misses. Are they actually very closely related? For example, if all the cache lines are filled and we have a read request for memory B, for which we have to evict memory A. So

Understanding CPU cache and cache line

柔情痞子 提交于 2019-11-28 15:55:10
I am trying to understand how CPU cache is operating. Lets say we have this configuration (as an example). Cache size 1024 bytes Cache line 32 bytes 1024/32 = 32 cache lines all together. Singel cache line can store 32/4 = 8 ints. 1) According to these configuration length of tag should be 32-5=27 bits, and size of index 5 bits (2^5 = 32 addresses for each byte in cache line). If total cache size is 1024 and there are 32 cache lines, where is tags+indexes are stored? (There is another 4*32 = 128 bytes.) Does it means that actual size of the cache is 1024+128 = 1152? 2) If cache line is 32

What is a cache hit and a cache miss? Why would context-switching cause cache miss?

喜你入骨 提交于 2019-11-28 15:54:57
问题 From the 11th Chapter( Performance and Scalability ) and the section named Context Switching of the JCIP book: When a new thread is switched in, the data it needs is unlikely to be in the local processor cache, so a context-switch causes a flurry of cache misses, and thus threads run a little more slowly when they are first scheduled. Can someone explain in an easy to understand way the concept of cache miss and its probable opposite ( cache hit )? Why context-switching would cause a lot of

Using time stamp counter and clock_gettime for cache miss

情到浓时终转凉″ 提交于 2019-11-28 14:39:00
As a follow-up to this topic , in order to calculate the memory miss latency, I have wrote the following code using _mm_clflush , __rdtsc and _mm_lfence (which is based on the code from this question/answer ). As you can see in the code, I first load the array into the cache. Then I flush one element and therefore the cache line is evicted from all cache levels. I put _mm_lfence in order to preserve the order during -O3 . Next, I used time stamp counter to calculate the latency or reading array[0] . As you can see between two time stamps, there are three instructions: two lfence and one read .

WBINVD instruction usage

℡╲_俬逩灬. 提交于 2019-11-28 11:23:58
I'm trying to use the WBINV instruction on linux to clear the processor's L1 cache. The following program compiles, but produces a segmentation fault when I try to run it. int main() {asm ("wbinvd"); return 1;} I'm using gcc 4.4.3 and run Linux kernel 2.6.32-33 on my x86 box. Processor info: Intel(R) Core(TM)2 Duo CPU T5270 @ 1.40GHz I built the program as follows: $ gcc $ ./a.out Segmentation Fault Can somebody tell me what I'm doing wrong? How do I get this to run? P.S: I'm running a few performance tests and want to ensure that the previous content of the processor cache does not influence

Benchmarking affected by VCL

邮差的信 提交于 2019-11-28 09:59:31
问题 Today I ported my old memory benchmark from Borland C++ builder 5.0 to BDS2006 Turbo C++ and found out weird thing. exe from BCB5 runs OK and stable exe from BDS2006 measure OK only before main Form is started (inside its constructor) and if the benchmark is started again after main form is Activated or even after any VCL component change (for example Caption of main form) then the speed of benchmark thread is strongly affected. After some research I found out that: Does not mater if test is

Is stack memory contiguous physically in Linux?

梦想与她 提交于 2019-11-28 08:42:12
问题 As far as I can see, stack memory is contiguous in virtual memory address, but stack memory is also contiguous physically? And does this have something to do with the stack size limit? Edit: I used to believe that stack memory doesn't has to be contiguous physically, but why do we think that stack memory is always quicker than heap memory? If it's not physically contiguous, how can stack take more advantage of cache? And there is another thing that always confuse me, cpu executes directives

VIPT Cache: Connection between TLB & Cache?

瘦欲@ 提交于 2019-11-28 08:20:15
问题 I just want to clarify the concept and could find detail enough answers which can throw some light upon how everything actually works out in the hardware. Please provide any relevant details. In case of VIPT caches, the memory request is sent in parallel to both the TLB and the Cache. From the TLB we get the traslated physical address. From the cache indexing we get a list of tags (e.g. from all the cache lines belonging to a set). Then the translated TLB address is matched with the list of