C++ cache aware programming

后端 未结 10 1440
离开以前
离开以前 2020-12-07 10:53

is there a way in C++ to determine the CPU\'s cache size? i have an algorithm that processes a lot of data and i\'d like to break this data down into chunks such that they f

相关标签:
10条回答
  • 2020-12-07 11:18

    Depending on what you're trying to do, you might also leave it to some library. Since you mention multicore processing, you might want to have a look at Intel Threading Building Blocks.

    TBB includes cache aware memory allocators. More specifically, check cache_aligned_allocator (in the reference documentation, I couldn't find any direct link).

    0 讨论(0)
  • 2020-12-07 11:23

    The cache will usually do the right thing. The only real worry for normal programmer is false sharing, and you can't take care of that at runtime because it requires compiler directives.

    0 讨论(0)
  • 2020-12-07 11:25

    IIRC, GCC has a __builtin_prefetch hint.

    http://gcc.gnu.org/onlinedocs/gcc-3.3.6/gcc/Other-Builtins.html

    has an excellent section on this. Basically, it suggests:

    __builtin_prefetch (&array[i + LookAhead], rw, locality);
    

    where rw is a 0 (prepare for read) or 1 (prepare for a write) value, and locality uses the number 0-3, where zero is no locality, and 3 is very strong locality.

    Both are optional. LookAhead would be the number of elements to look ahead to. If memory access were 100 cycles, and the unrolled loops are two cycles apart, LookAhead could be set to 50 or 51.

    0 讨论(0)
  • 2020-12-07 11:28

    According to "What every programmer should know about memory", by Ulrich Drepper you can do the following on Linux:

    Once we have a formula for the memory requirement we can compare it with the cache size. As mentioned before, the cache might be shared with multiple other cores. Currently {There definitely will sometime soon be a better way!} the only way to get correct information without hardcoding knowledge is through the /sys filesystem. In Table 5.2 we have seen the what the kernel publishes about the hardware. A program has to find the directory:

    /sys/devices/system/cpu/cpu*/cache
    

    This is listed in Section 6: What Programmers Can Do.

    He also describes a short test right under Figure 6.5 which can be used to determine L1D cache size if you can't get it from the OS.

    There is one more thing I ran across in his paper: sysconf(_SC_LEVEL2_CACHE_SIZE) is a system call on Linux which is supposed to return the L2 cache size although it doesn't seem to be well documented.

    0 讨论(0)
  • 2020-12-07 11:28

    You can see this thread: http://software.intel.com/en-us/forums/topic/296674

    The short answer is in this other thread:

    On modern IA-32 hardware, the cache line size is 64. The value 128 is a legacy of the Intel Netburst Microarchitecture (e.g. Intel Pentium D) where 64-byte lines are paired into 128-byte sectors. When a line in a sector is fetched, the hardware automatically fetches the other line in the sector too. So from a false sharing perspective, the effective line size is 128 bytes on the Netburst processors. (http://software.intel.com/en-us/forums/topic/292721)

    0 讨论(0)
  • 2020-12-07 11:28

    There are two cases that need to be distinguished. Do you need to know the cache sizes at compile time or at runtime?

    Determining the cache-size at compile-time

    For some applications, you know the exact architecture that your code will run on, for example, if you can compile the code directly on the host machine. In that case, simplify looking up the size and hard-coding it is an option (could be automated in the build system). On most machines today, the L1 cache line should be 64 bytes.

    If you want to avoid that complexity or if you need to support compilation on unknown architectures, you can use the C++17 feature std::hardware_constructive_interference_size as a good fallback. It will provide a compile-time estimation for the cache line, but be aware of its limitations. Note that the compiler cannot guess perfectly when it creates the binary, as the size of the cache-line is, in general, architecture dependent.

    Determining the cache-size at runtime

    At runtime, you have the advantage that you know the exact machine, but you will need platform specific code to read the information from the OS. A good starting point is the code snippet from this answer, which supports the major platforms (Windows, Linux, MacOS). In a similar fashion, you can also read the L2 cache size at runtime.

    I would advise against trying to guess the cache line by running benchmarks at startup and measuring which one performed best. It might well work, but it is also error-prone if the CPU is used by other processes.

    Combining both approaches

    If you have to ship one binary and the machines that it will later run on features a range of different architectures with varying cache sizes, you could create specialized code parts for each cache size, and then dynamically (at application startup) choose the best fitting one.

    0 讨论(0)
提交回复
热议问题