Is there any way to write/read memory without touching L1/L2/L3 cache under x86 CPUs?
And is cache in x86 CPUs totally managed by hardware?
ED
Leeor preety much listed the most "pro" solutions for your task. I'll try to add to that with another proposal that can achieve same results, and can be written in plain C with a simple code. The idea is making a kernel similar to "Global Random Access" found in the HPCC Challenge benchmark.
The idea of the kernel is to jump randomly through a huge array of 8B values that is generraly 1/2 the size of your physical memory (So if you have 16 GB of RAM you need an 8GB array leading to 1G elements of 8B). For each jump you can read, write or RMW the target location.
This most likely measures the RAM latency because jumping randomly through RAM makes caching very inefficient. You will get extremely low cache hit rates and if you make sufficient operations on the array, you will be able to measure the actual performance of memory. This method also makes prefetching very ineffective as there is no detectable pattern.
You need to take into consideration following things: