So I am trying to measure the latencies of L1, L2, L3 cache using C. I know the size of them and I feel I understand conceptually how to do it but I am running into problems
Ok, several issues with your code:
As you mentioned, your measurement are taking a long time. In fact, they're very likely to take way longer than the single access itself, so they're not measuring anything useful. To mitigate that, access multiple elements, and amortize (divide the overall time by the number of accesses. Note that to measure latency, you want these accesses to be serialized, otherwise they can be performed in parallel and you'll only measure the throughput of unrelated accesses. To achieve that you could just add a false dependency between accesses.
For e.g., initialize the array to zeros, and do:
clock_gettime(CLOCK_REALTIME, &startAccess); //start clock
for (int i = 0; i < NUM_ACCESSES; ++i) {
int tmp = arrayAccess[index]; //Access Value from Main Memory
index = (index + i + tmp) & 1023;
}
clock_gettime(CLOCK_REALTIME, &endAccess); //end clock
.. and of course remember to divide the time by NUM_ACCESSES
.
Now, i've made the index intentionally complicated so that you avoid a fixed stride which might trigger a prefetcher (a bit of an overkill, you're not likely to notice an impact, but for the sake of demonstration...). You could probably settle for a simple index += 32
, which would give you strides of 128k (two cache lines), and avoid the "benefit" of most simple adjacent line/ simple stream prefetchers. I've also replaced the % 1000
with & 1023
since &
is faster, but it needs to be power of 2 to work the same way - so just increase ACCESS_SIZE
to 1024 and it should work.
Invalidating the L1 by loading something else is good, but the sizes look funny. You didn't specify your system but 256000
seems pretty big for an L1. An L2 is usually 256k on many common modern x86 CPUs for e.g. Also note that 256k is not 256000
, but rather 256*1024=262144
. Same goes for the second size: 1M is not 1024000
, it's 1024*1024=1048576
. Assuming that's indeed your L2 size (more likely an L3, but probably too small for that).
Your invalidating arrays are of type int
, so each element is longer than a single byte (most likely 4 byte, depending on system). You're actually invalidating L1_CACHE_SIZE*sizeof(int)
worth of bytes (and the same goes for the L2 invalidation loop)
memset
receives the size in bytes, your sizes are divided by sizeof(int)
Your invalidation reads are never used, and may be optimized out. Try to accumulate the reads in some value and print it in the end, to avoid this possibility.
The memset at the beginning is accessing the data as well, therefor your first loop is accessing data from the L3 (since the other 2 memsets were still effective in evicting it from L1+L2, although only partially due to the size error.
The strides may be too small so you get two access to the same cacheline (L1 hit). Make sure they're spread enough by adding 32 elements (x4 bytes) - that's 2 cacheline, so you also won't get any adjacent cacheline prefetch benefits.
Since NUM_ACCESSES is larger than ACCESS_SIZE, you're essentially repeating the same elements and would probably get L1 hits for them (so the avg time shifts in favor of L1 access latency). Instead try using the L1 size so you access the entire L1 (except for the skips) exactly once. For e.g. like this -
index = 0;
while (index < L1_CACHE_SIZE) {
int tmp = arrayAccess[index]; //Access Value from L2
index = (index + tmp + ((index & 4) ? 28 : 36)); // on average this should give 32 element skips, with changing strides
count++; //divide overall time by this
}
don't forget to increase arrayAccess
to L1 size.
Now, with the changes above (more or less), I get something like this:
L1 Cache Access 7.812500
L2 Cache Acces 15.625000
L3 Cache Access 23.437500
Which still seems a bit long, but possibly because it includes an additional dependency on arithmetic operations