I am interested in flushing cache (L1, L2, and L3) only for a region of address space, for example all cache entries from address A to address B. Is there a mechanism to do
Several people have expressed misgivings about clear_cache. Below is a manual process to evict the cache which is in-efficient, but possible from any user-space task (in any OS).
It is possible to evict caches by mis-using the pld instruction. The pld will fetch a cache line. In order to evict a specific memory address, you need to know the structure of your caches. For instance, a cortex-a9 has a 4-way data cache with 8 words per line. The cache size is configurable into 16KB, 32KB, or 64KB. So that is 512, 1024 or 2048 lines. The ways are always insignificant to the lower address bits (so sequential addresses don't conflict). So you will fill a new way by accessing memory offset + cache size / ways. So that is every 4KB, 8KB and 16KB for a cortex-a9.
Using ldr in 'C' or 'C++' is simple. You just need to size an array appropriately and access it.
See: Programmatically get the cache line size?
For example, if you want to evict 0x12345 the line starts at 0x12340 and for a 16KB round-robin cache a pld on 0x13340, 0x14340, 0x15340, and 0x16340 would evict any value form that way. The same principal can be applied to evict L2 (which is often unified). Iterating over all of the cache size will evict the entire cache. You need to allocate an unused memory the size of the cache to evict the entire cache. This might be quite large for the L2. pld doesn't need to be used, but a full memory access (ldr/ldm). For multiple CPUs (threaded cache eviction) you need to run the eviction on each CPU. Usually the L2 is global to all CPUs so it only needs to be run once.
NB: This method only works with LRU (least recently used) or round-robin caches. For pseudo-random replacement, you will have to write/read more data to ensure eviction, with an exact amount being highly CPU specific. The ARM random replacement is based on an LFSR that is from 8-33bits depending on the CPU. For some CPUs, it defaults to round-robin and others default to the pseudo-random mode. For a few CPUs a Linux kernel configuration will select the mode. ref: CPU_CACHE_ROUND_ROBIN However, for newer CPUs, Linux will use the default from the boot loader and/or silicon. In other words, it is worth the effort to try and get clear_cache OS calls to work (see other answers) if you need to be completely generic or you will have to spend a lot of time to clear the caches reliably.
It is possible to circumvent the cache by fooling an OS using the MMU on some ARM CPUs and particular OSes. On an *nix system, you need multiple processes. You need to switch between processes and the OS should flush caches. Typically this will only work on older ARM CPUs (ones not supporting pld) where the OS should flush the caches to ensure not information leakage between processes. It is not portable and requires that you understand a lot about your OS.
Most explicit cache flushing registers are restricted to system mode to prevent denial of service type attacks between processes. Some exploits can try to gain information by seeing what lines have been evicted by some other process (this can give information about what addresses another process is accessing). These attacks are more difficult with pseudo-random replacement.