How to find the size of the L1 cache line size with IO timing measurements?

牧云@^-^@ 提交于 2019-12-04 07:41:03

问题


As a school assignment, I need to find a way to get the L1 data cache line size, without reading config files or using api calls. Supposed to use memory accesses read/write timings to analyze & get this info. So how might I do that?

In an incomplete try for another part of the assignment, to find the levels & size of cache, I have:

for (i = 0; i < steps; i++) {
    arr[(i * 4) & lengthMod]++;
}

I was thinking maybe I just need vary line 2, (i * 4) part? So once I exceed the cache line size, I might need to replace it, which takes sometime? But is it so straightforward? The required block might already be in memory somewhere? Or perpahs I can still count on the fact that if I have a large enough steps, it will still work out quite accurately?

UPDATE

Heres an attempt on GitHub ... main part below

// repeatedly access/modify data, varying the STRIDE
for (int s = 4; s <= MAX_STRIDE/sizeof(int); s*=2) {
    start = wall_clock_time();
    for (unsigned int k = 0; k < REPS; k++) {
        data[(k * s) & lengthMod]++;
    }
    end = wall_clock_time();
    timeTaken = ((float)(end - start))/1000000000;
    printf("%d, %1.2f \n", s * sizeof(int), timeTaken);
}

Problem is there dont seem to be much differences between the timing. FYI. since its for L1 cache. I have SIZE = 32 K (size of array)


回答1:


Allocate a BIG char array (make sure it is too big to fit in L1 or L2 cache). Fill it with random data.

Start walking over the array in steps of n bytes. Do something with the retrieved bytes, like summing them.

Benchmark and calculate how many bytes/second you can process with different values of n, starting from 1 and counting up to 1000 or so. Make sure that your benchmark prints out the calculated sum, so the compiler can't possibly optimize the benchmarked code away.

When n == your cache line size, each access will require reading a new line into the L1 cache. So the benchmark results should get slower quite sharply at that point.

If the array is big enough, by the time you reach the end, the data at the beginning of the array will already be out of cache again, which is what you want. So after you increment n and start again, the results will not be affected by having needed data already in the cache.




回答2:


Have a look at Calibrator, all of the work is copyrighted but source code is freely available. From its document idea to calculate cache line sizes sounds much more educated than what's already said here.

The idea underlying our calibrator tool is to have a micro benchmark whose performance only depends on the frequency of cache misses that occur. Our calibrator is a simple C program, mainly a small loop that executes a million memory reads. By changing the stride (i.e., the offset between two subsequent memory accesses) and the size of the memory area, we force varying cache miss rates.

In principle, the occurance of cache misses is determined by the array size. Array sizes that fit into the L1 cache do not generate any cache misses once the data is loaded into the cache. Analogously, arrays that exceed the L1 cache size but still fit into L2, will cause L1 misses but no L2 misses. Finally, arrays larger than L2 cause both L1 and L2 misses.

The frequency of cache misses depends on the access stride and the cache line size. With strides equal to or larger than the cache line size, a cache miss occurs with every iteration. With strides smaller than the cache line size, a cache miss occurs only every n iterations (on average), where n is the ratio cache line size/stride.

Thus, we can calculate the latency for a cache miss by comparing the execution time without misses to the execution time with exactly one miss per iteration. This approach only works, if memory accesses are executed purely sequential, i.e., we have to ensure that neither two or more load instructions nor memory access and pure CPU work can overlap. We use a simple pointer chasing mechanism to achieve this: the memory area we access is initialized such that each load returns the address for the subsequent load in the next iteration. Thus, super-scalar CPUs cannot benefit from their ability to hide memory access latency by speculative execution.

To measure the cache characteristics, we run our experiment several times, varying the stride and the array size. We make sure that the stride varies at least between 4 bytes and twice the maximal expected cache line size, and that the array size varies from half the minimal expected cache size to at least ten times the maximal expected cache size.

I had to comment out #include "math.h" to get it compiled, after that it found my laptop's cache values correctly. I also couldn't view postscript files generated.




回答3:


You can use the CPUID function in assembler, although non portable, it will give you what you want.

For Intel Microprocessors, the Cache Line Size can be calculated by multiplying bh by 8 after calling cpuid function 0x1.

For AMD Microprocessors, the data Cache Line Size is in cl and the instruction Cache Line Size is in dl after calling cpuid function 0x80000005.

I took this from this article here.




回答4:


I think you should write program, that will walk throught array in random order instead straight, because modern process do hardware prefetch. For example, make array of int, which values will number of next cell. I did similar program 1 year ago http://pastebin.com/9mFScs9Z Sorry for my engish, I am not native speaker.




回答5:


See how to memtest86 is implemented. They measure and analyze data transfer rate in some way. Points of rate changing is corresponded to size of L1, L2 and possible L3 cache size.




回答6:


If you get stuck in the mud and can't get out, look here.

There are manuals and code that explain how to do what you're asking. The code is pretty high quality as well. Look at "Subroutine library".

The code and manuals are based on X86 processors.




回答7:


I think it should be enough to time an operation that uses some amount of memory. Then progresively increase the memory (operands for instance) used by the operation. When the operation performance severelly decreases you have found the limit.

I would go with just reading a bunch of bytes without printing them (printing would hit the performance so bad that would become a bottleneck). While reading, the timing should be directly proportinal to the ammount of bytes read until the data cannot fit the L1 anymore, then you will get the performance hit.

You should also allocate the memory once at the start of the program and before starting to count time.




回答8:


Just a note.

Cache line size is variable on few ARM Cortex families and can change during execution without any notifications to a current program.



来源:https://stackoverflow.com/questions/12675092/how-to-find-the-size-of-the-l1-cache-line-size-with-io-timing-measurements

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!