cpu-cache | 易学教程

How can I share library between two program in c

阅读更多关于 How can I share library between two program in c

问题 I want to use same library functions (i.e. OpenSSL library ) in two different programs in C for computation. How can I make sure that both program use a common library , means only one copy of library is loaded into shared main memory and both program access the library from that memory location for computation? For example, when 1st program access the library for computation it is loaded into cache from main memory and when the 2nd program wants to access it later , it will access the data

Implementing a cache modeling framework

阅读更多关于 Implementing a cache modeling framework

问题 I would like to model the behavior of caches in Intel architectures (LRU, inclusive, K-Way Associative, etc)., I've read wikipedia, Ulrich Drepper's great paper on memory, and the Intel Manual Volume 3A: System Programming Guide (chapter 11, but it's not very helpful, because they only explain what can be manipulated at the software level). I've also read a bunch of academic papers, but as usual, they do not make their code available for replication... even after asking for it. My question is

CUDA disable L1 cache only for one variable

阅读更多关于 CUDA disable L1 cache only for one variable

Is there any way on CUDA 2.0 devices to disable L1 cache only for one specific variable? I know that one can disable L1 cache at compile time adding the flag -Xptxas -dlcm=cg to nvcc for all memory operations. However, I want to disable cache only for memory reads upon a specific global variable so that all of the rest of the memory reads to go through the L1 cache. Based on a search I have done in the web, a possible solution is through PTX assembly code. Reguj As mentioned above you can use inline PTX, here is an example: __device__ __inline__ double ld_gbl_cg(const double *addr) { double

Where is the Write-Combining Buffer located? x86

阅读更多关于 Where is the Write-Combining Buffer located? x86

问题 How is the Write-Combine buffer physically hooked up? I have seen block diagrams illustrating a number of variants: Between L1 and Memory controller Between CPU's store buffer and Memory controller Between CPU's AGUs and/or store units Is it microarchitecture-dependent? 回答1: Write buffers can have different purposes or different uses in different processors. This answer may not apply to processors not specifically mentioned. I'd like to emphasis that the term "write buffer" may mean different

How are cache memories shared in multicore Intel CPUs?

阅读更多关于 How are cache memories shared in multicore Intel CPUs?

I have a few questions regarding Cache memories used in Multicore CPUs or Multiprocessor systems. (Although not directly related to programming, it has many repercussions while one writes software for multicore processors/multiprocessors systems, hence asking here!) In a multiprocessor system or a multicore processor (Intel Quad Core, Core two Duo etc..) does each cpu core/processor have its own cache memory (data and program cache)? Can one processor/core access each other's cache memory, because if they are allowed to access each other's cache, then I believe there might be lesser cache

C++ cache aware programming

阅读更多关于 C++ cache aware programming

is there a way in C++ to determine the CPU's cache size? i have an algorithm that processes a lot of data and i'd like to break this data down into chunks such that they fit into the cache. Is this possible? Can you give me any other hints on programming with cache-size in mind (especially in regard to multithreaded/multicore data processing)? Thanks! According to " What every programmer should know about memory ", by Ulrich Drepper you can do the following on Linux: Once we have a formula for the memory requirement we can compare it with the cache size. As mentioned before, the cache might be

How is x86 instruction cache synchronized?

阅读更多关于 How is x86 instruction cache synchronized?

I like examples, so I wrote a bit of self-modifying code in c... #include <stdio.h> #include <sys/mman.h> // linux int main(void) { unsigned char *c = mmap(NULL, 7, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE| MAP_ANONYMOUS, -1, 0); // get executable memory c[0] = 0b11000111; // mov (x86_64), immediate mode, full-sized (32 bits) c[1] = 0b11000000; // to register rax (000) which holds the return value // according to linux x86_64 calling convention c[6] = 0b11000011; // return for (c[2] = 0; c[2] < 30; c[2]++) { // incr immediate data after every run // rest of immediate data (c[3:6]) are

Does a memory barrier ensure that the cache coherence has been completed?

阅读更多关于 Does a memory barrier ensure that the cache coherence has been completed?

Say I have two threads that manipulate the global variable x . Each thread (or each core I suppose) will have a cached copy of x . Now say that Thread A executes the following instructions: set x to 5 some other instruction Now when set x to 5 is executed, the cached value of x will be set to 5 , this will cause the cache coherence protocol to act and update the caches of the other cores with the new value of x . Now my question is: when x is actually set to 5 in Thread A 's cache, do the caches of the other cores get updated before some other instruction is executed? Or should a memory

Why does the speed of memcpy() drop dramatically every 4KB?

阅读更多关于 Why does the speed of memcpy() drop dramatically every 4KB?

I tested the speed of memcpy() noticing the speed drops dramatically at i*4KB. The result is as follow: the Y-axis is the speed(MB/second) and the X-axis is the size of buffer for memcpy() , increasing from 1KB to 2MB. Subfigure 2 and Subfigure 3 detail the part of 1KB-150KB and 1KB-32KB. Environment： CPU : Intel(R) Xeon(R) CPU E5620 @ 2.40GHz OS : 2.6.35-22-generic #33-Ubuntu GCC compiler flags : -O3 -msse4 -DINTEL_SSE4 -Wall -std=c99 I guess it must be related to caches, but I can't find a reason from the following cache-unfriendly cases: Why is my program slow when looping over exactly 8192

Are there any modern CPUs where a cached byte store is actually slower than a word store?

阅读更多关于 Are there any modern CPUs where a cached byte store is actually slower than a word store?

It's a common claim that a byte store into cache may result in an internal read-modify-write cycle, or otherwise hurt throughput or latency vs. storing a full register. But I've never seen any examples. No x86 CPUs are like this, and I think all high-performance CPUs can directly modify any byte in a cache-line, too. Are some microcontrollers or low-end CPUs different, if they have cache at all? ( I'm not counting word-addressable machines , or Alpha which is byte addressable but lacks byte load/store instructions. I'm talking about the narrowest store instruction the ISA natively supports.)