cpu-cache

Benchmarking affected by VCL

故事扮演 提交于 2019-11-29 15:37:49
Today I ported my old memory benchmark from Borland C++ builder 5.0 to BDS2006 Turbo C++ and found out weird thing. exe from BCB5 runs OK and stable exe from BDS2006 measure OK only before main Form is started (inside its constructor) and if the benchmark is started again after main form is Activated or even after any VCL component change (for example Caption of main form) then the speed of benchmark thread is strongly affected. After some research I found out that: Does not mater if test is inside thread or not. The process/thread priority,affinity does not affect this either. Hide of any

Why isn't there a data bus which is as wide as the cache line size?

不问归期 提交于 2019-11-29 14:09:44
问题 When a cache miss occurs, the CPU fetches a whole cache line from main memory into the cache hierarchy. (typically 64 bytes on x86_64) This is done via a data bus, which is only 8 byte wide on modern 64 bit systems. (since the word size is 8 byte) EDIT: "Data bus" means the bus between the CPU die and the DRAM modules in this context. This data bus width does not necessarily correlate with the word size. Depending on the strategy the actually requested address gets fetched at first, and then

How can I share library between two program in c

↘锁芯ラ 提交于 2019-11-29 12:31:14
I want to use same library functions (i.e. OpenSSL library ) in two different programs in C for computation. How can I make sure that both program use a common library , means only one copy of library is loaded into shared main memory and both program access the library from that memory location for computation? For example, when 1st program access the library for computation it is loaded into cache from main memory and when the 2nd program wants to access it later , it will access the data from cache ( already loaded by 1st program), not from main memory again. I am using GCC under Linux. Any

Implementing a cache modeling framework

旧时模样 提交于 2019-11-29 12:02:18
I would like to model the behavior of caches in Intel architectures (LRU, inclusive, K-Way Associative, etc)., I've read wikipedia, Ulrich Drepper's great paper on memory, and the Intel Manual Volume 3A: System Programming Guide (chapter 11, but it's not very helpful, because they only explain what can be manipulated at the software level). I've also read a bunch of academic papers, but as usual, they do not make their code available for replication... even after asking for it. My question is, is there already a publicly available framework to model cache behavior? If not, is there a document

How can the L1, L2, L3 CPU caches be turned off on modern x86/amd64 chips?

时光总嘲笑我的痴心妄想 提交于 2019-11-29 12:01:10
问题 Every modern high-performance CPU of the x86/x86_64 architecture has some hierarchy of data caches: L1, L2, and sometimes L3 (and L4 in very rare cases), and data loaded from/to main RAM is cached in some of them. Sometimes the programmer may want some data to not be cached in some or all cache levels (for example, when wanting to memset 16 GB of RAM and keep some data still in the cache): there are some non-temporal (NT) instructions for this like MOVNTDQA (https://stackoverflow.com/a/37092

Where is the Write-Combining Buffer located? x86

巧了我就是萌 提交于 2019-11-29 10:42:20
How is the Write-Combine buffer physically hooked up? I have seen block diagrams illustrating a number of variants: Between L1 and Memory controller Between CPU's store buffer and Memory controller Between CPU's AGUs and/or store units Is it microarchitecture-dependent? Write buffers can have different purposes or different uses in different processors. This answer may not apply to processors not specifically mentioned. I'd like to emphasis that the term "write buffer" may mean different things in different contexts. This answer is about Intel and AMD processors only. Write-Combining Buffers

Does x86_64 CPU use the same cache lines for communicate between 2 processes via shared memory?

不羁岁月 提交于 2019-11-29 08:46:05
As known all levels of cache L1/L2/L3 on modern x86_64 are virtually indexed, physically tagged . And all cores communicate via Last Level Cache - cache-L3 by using cache coherent protocol MOESI/MESIF over QPI/HyperTransport. For example, Sandybridge family CPU has 4 - 16 way cache L3 and page_size 4KB, then this allows to exchange the data between concurrent processes which are executed on different cores via a shared memory. This is possible because cache L3 can't contain the same physical memory area as a page of process 1 and as a page of process 2 at the same time. Does this mean that

Is there a way to check whether the processor cache has been flushed recently?

[亡魂溺海] 提交于 2019-11-29 08:28:17
问题 On i386 linux. Preferably in c/(c/posix std libs)/proc if possible. If not is there any piece of assembly or third party library that can do this? Edit: I'm trying to develop test whether a kernel module clear a cache line or the whole proccesor(with wbinvd()). Program runs as root but I'd prefer to stay in user space if possible. 回答1: Cache coherent systems do their utmost to hide such things from you. I think you will have to observe it indirectly, either by using performance counting

loop tiling. how to choose block size?

旧街凉风 提交于 2019-11-29 03:13:10
问题 I am trying to learn the loop optimization. i found that loop tiling helps in making the array looping faster. i tried with two block of codes given below with and without loop blocking and measure the time taken for both. i did not find significant difference most of the time. i tested varying the block size but i am not sure how to choose the block size. please help me if my direction is wrong. in fact i found the loop without block works faster many times. a. With blocking int max =

Cache bandwidth per tick for modern CPUs

馋奶兔 提交于 2019-11-28 23:25:22
What is a speed of cache accessing for modern CPUs? How many bytes can be read or written from memory every processor clock tick by Intel P4, Core2, Corei7, AMD? Please, answer with both theoretical (width of ld/sd unit with its throughput in uOPs/tick) and practical numbers (even memcpy speed tests, or STREAM benchmark), if any. PS it is question, related to maximal rate of load/store instructions in assembler. There can be theoretical rate of loading (all Instructions Per Tick are widest loads), but processor can give only part of such, a practical limit of loading. osgx For nehalem: rolfed