cpu-cache

Write Allocate / Fetch on Write Cache Policy

|▌冷眼眸甩不掉的悲伤 提交于 2019-12-08 03:22:24
问题 I couldn't find a source that explains how the policy works in great detail. The combinations of write policies are explained in Jouppi's Paper for the interested. This is how I understood it. A write request is sent from cpu to cache. Request results in a cache-miss. A cache block is allocated for this request in cache.(Write-Allocate) Write request block is fetched from lower memory to the allocated cache block.(Fetch-on-Write) Now we are able to write onto allocated and updated by fetch

Why do L1 and L2 Cache waste space saving the same data?

蓝咒 提交于 2019-12-07 16:33:02
问题 I don't know why L1 Cache and L2 Cache save the same data. For example, let's say we want to access Memory[x] for the first time. Memory[x] is mapped to the L2 Cache first, then the same data piece is mapped to L1 Cache where CPU register can retrieve data from. But we have duplicated data stored on both L1 and L2 cache, isn't it a problem or at least a waste of storage space? 回答1: I edited your question to ask about why CPUs waste cache space storing the same data in multiple levels of cache

Sandy-Bridge CPU specification

人走茶凉 提交于 2019-12-07 11:02:10
问题 I was able to put together bits here and there about the Sandy Bridge-E architecture but I am not totally sure about all the parameters e.g. the size of the L2 cache. Can anyone please confirm they are all correct? My main source was the 64-ia-32-architectures-optimization-manual.pdf 回答1: On sandy bridge, each core has 256KB of L2 (see the datasheet, section 1.1). for 6 cores, that's 1.5MB, but since each core only accesses its own, it's better to always look at it as 256KB per core. Moreover

According to Intel my cache should be 24-way associative though its 12-way, how is that?

我们两清 提交于 2019-12-07 09:11:44
问题 According to “Intel 64 and IA-32 architectures optimization reference manual,” April 2012 page 2-23 The physical addresses of data kept in the LLC data arrays are distributed among the cache slices by a hash function, such that addresses are uniformly distributed. The data array in a cache block may have 4/8/12/16 ways corresponding to 0.5M/1M/1.5M/2M block size. However, due to the address distribution among the cache blocks from the software point of view, this does not appear as a normal N

Optimizing Cortex-A8 color conversion using NEON

社会主义新天地 提交于 2019-12-07 05:47:54
问题 I am currently doing a color conversion routine in order to convert from YUY2 to NV12. I have a function which is quite fast, but not as fast as I would expect, mainly due to cache misses. void convert_hd(uint8_t *orig, uint8_t *result) { uint32_t width = 1280; uint32_t height = 720; uint8_t *lineOdd = orig; uint8_t *lineEven = orig + width*2; uint8_t *resultYOdd = result; uint8_t *resultYEven = result + width; uint8_t *resultUV = result + height*width; uint32_t totalLoop = height/2; while

Virtually indexed physically tagged cache Synonym

别说谁变了你拦得住时间么 提交于 2019-12-07 05:27:00
问题 I am not able to entirely grasp the concept of synonyms or aliasing in VIPT caches. Consider the address split as:- Here, suppose we have 2 pages with different VA's mapped to same physical address(or frame no). The pageno part of VA (bits 13-39) which are different gets translated to PFN of PA (bits 12-35) and the PFN remains same for both the VA's as they are mapped to same physical frame. Now the pageoffset part(bits 0-13) of both the VA's are same as the data which they want to access

What is the cache line size on iPhone and iPad?

牧云@^-^@ 提交于 2019-12-07 03:39:24
What is the cache line size on iPhone and iPad? And does it vary much between the different devices and CPUs? This is not too easy to find with google. I need to squeeze some extra performance from my app. :) Well, the Cortex-A8 has 64-byte lines , Cortex-A9 has 32-byte lines , as for Swift and Cyclone I don't know - looking at comparable cores (A15, A57, Scorpion, Krait) 32 or 64 bytes seems likely. Either way there's at least 2 different lengths across iOS7 machines. As you're performance-focused though, remember that benchmarking is infinitely more valuable than theorising - try as many

Is there a cheaper serializing instruction than cpuid?

本小妞迷上赌 提交于 2019-12-06 13:40:55
问题 I have seen the related question including here and here, but it seems that the only instruction ever mentioned for serializing rdtsc is cpuid . Unfortunately, cpuid takes roughly 1000 cycles on my system, so I am wondering if anyone knows of a cheaper (fewer cycles and no read or write to memory) serializing instruction? I looked at iret , but that seems to change control flow, which is also undesirable. I have actually looked at the whitespaper linked in Alex's answer about rstscp , but it

What is the improvement in ARM11 for cache

一个人想着一个人 提交于 2019-12-06 13:25:14
问题 It's said in ARM11, Cache is physically addressed, solving many cache aliasing problems and reducing context switch overhead How to understand physically addressed ? How does it help to solve the cache aliasing problems and to reduce the context switch overhead? 回答1: There are three common types of cache. VIVT = Virtually Indexed Virtually Tagged VIPT = Virtually Indexed Physically Tagged PIPT = Physically Indexed Physically Tagged there is also PIVT = Physically Indexed Virtually Tagged PIPT

The ordering of L1 cache controller to process memory requests from CPU

[亡魂溺海] 提交于 2019-12-06 13:17:15
Under the total store order(TSO) memory consistency model, a x86 cpu will have a write buffer to buffer write requests and can serve reordered read requests from the write buffer. And it says that the write requests in the write buffer will exit and be issued toward cache hierarchy in FIFO order, which is the same as program order. I am curious about: To serve the write requests issued from the write buffer, does L1 cache controller handle the write requests, finish the cache coherence of the write requests and insert data into L1 cache in the same order as the issue order? I think you're