cpu-cache | 易学教程

CUDA disable L1 cache only for one variable

阅读更多关于 CUDA disable L1 cache only for one variable

问题 Is there any way on CUDA 2.0 devices to disable L1 cache only for one specific variable? I know that one can disable L1 cache at compile time adding the flag -Xptxas -dlcm=cg to nvcc for all memory operations. However, I want to disable cache only for memory reads upon a specific global variable so that all of the rest of the memory reads to go through the L1 cache. Based on a search I have done in the web, a possible solution is through PTX assembly code. 回答1: As mentioned above you can use

How many bits are in the address field for a directly mapped cache?

阅读更多关于 How many bits are in the address field for a directly mapped cache?

问题 This is a question based on Direct Mapped Cache so I am assuming that it's ok to ask here as well. Here is the problem I am working on: The Problem: " A high speed workstation has 64 bit words and 64 bit addresses with address resolution at the byte level. Assuming a direct mapped cache with 8192 64 byte lines, how many bits are in each of the following address fields for the cache? 1) byte 2) Index 3) Tag?" First I defined the terms in this problem and used the other Stack Overflow Direct

prefetching data at L1 and L2

阅读更多关于 prefetching data at L1 and L2

问题 In Agner Fog's manual Optimizing software in C++ in section 9.10 "Cahce contentions in large data structures" he describes a problem transposing a matrix when the matrix width is equal to something called the critical stride. In his test the cost for for a matrix in L1 is 40% greater when the width is equal to the critical stride. If the matrix is is even larger and only fits in L2 the cost is 600%! This is summed up nicely in Table 9.1 in his text. This is essential the same thing observed

How is x86 instruction cache synchronized?

阅读更多关于 How is x86 instruction cache synchronized?

问题 I like examples, so I wrote a bit of self-modifying code in c... #include <stdio.h> #include <sys/mman.h> // linux int main(void) { unsigned char *c = mmap(NULL, 7, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE| MAP_ANONYMOUS, -1, 0); // get executable memory c[0] = 0b11000111; // mov (x86_64), immediate mode, full-sized (32 bits) c[1] = 0b11000000; // to register rax (000) which holds the return value // according to linux x86_64 calling convention c[6] = 0b11000011; // return for (c[2] = 0; c

Cache size estimation on your system?

阅读更多关于 Cache size estimation on your system?

问题 I got this program from this link (https://gist.github.com/jiewmeng/3787223).I have been searching the web with the idea of gaining a better understanding of processor caches (L1 and L2).I want to be able to write a program that would enable me to guess the size of L1 and L2 cache on my new Laptop.(just for learning purpose.I know I could check the spec.) #include <stdio.h> #include <stdlib.h> #include <time.h> #define KB 1024 #define MB 1024 * 1024 int main() { unsigned int steps = 256 *

How does the VIPT to PIPT conversion work on L1->L2 eviction

阅读更多关于 How does the VIPT to PIPT conversion work on L1->L2 eviction

问题 This scenario came into my head and it seems a bit basic but I'll ask. So there is a virtual index and physical tag in L1 but the set becomes full so it is evicted. How does the L1 controller get the full physical address from the virtual index and the physical tag in L1 so the line can be inserted into L2? I suppose it could search the TLB for the combination but that seems slow and also it may not be in the TLB at all. Perhaps the full physical address from the original TLB translation is

Probable instruction Cache Synchronization issue in self modifying code?

阅读更多关于 Probable instruction Cache Synchronization issue in self modifying code?

问题 A lot of related questions <How is x86 instruction cache synchronized? > mention x86 should properly handle i-cache synchronization in self modifying code. I wrote the following piece of code which toggles a function call on and off from different threads interleaved with its execution. I am using compare and swap operation as an additional guard so that the modification is atomic. But I am getting intermittent crashes (SIGSEGV, SIGILL) and analyzing the core dump makes me suspicious if the

When CPU flush value in storebuffer to L1 Cache?

阅读更多关于 When CPU flush value in storebuffer to L1 Cache?

问题 Core A writes value x to storebuffer, waiting invalid ack and then flushes x to cache. Does it wait only one ack or wait all acks ? And how does it konw how many acks in all CPUs ? 回答1: It isn't clear to me what you mean by "invalid ack", but let's assume you mean a snoop/invalidation originating from another core which is requesting ownership of the same line. In this case, the stores in the store buffer are generally free to ignore such invalidations from other cores since the stores in the

How does cache associativity impact performance [duplicate]

阅读更多关于 How does cache associativity impact performance [duplicate]

问题 This question already has answers here : Why is transposing a matrix of 512x512 much slower than transposing a matrix of 513x513? (3 answers) Closed 4 months ago . I am reading "Pro .NET Benchmarking" by Andrey Akinshin and one thing puzzles me (p.536) -- explanation how cache associativity impacts performance. In a test author used 3 square arrays 1023x1023, 1024x1024, 1025x1025 of ints and observed that accessing first column was slower for 1024x1024 case. Author explained (background info,

Slowdown when accessing data at page boundaries?

阅读更多关于 Slowdown when accessing data at page boundaries?

问题 (My question is related to computer architecture and performance understanding. Did not find a relevant forum, so post it here as a general question.) I have a C program which accesses memory words that are located X bytes apart in virtual address space. For instance, for (int i=0;<some stop condition>;i+=X){array[i]=4;} . I measure the execution time with a varying value of X . Interestingly, when X is the power of 2 and is about page size, e.g., X=1024,2048,4096,8192... , I get to huge