Forcing GCC to perform loop unswitching of memcpy runtime size checks?

只愿长相守 提交于 2019-12-04 23:44:06

Ultimately, the issue at hand is one of asking the optimizer to make assumptions about runtime behavior based on multiple variables. While it is possible to provide the optimizer some compile-time hints via the use of 'const' and 'register' declarations on the key variables, ultimately, you're depending on the optimizer to make a lot of assumptions. Further, while the memcpy() may well be intrinsic, it's not guaranteed to be and even if/when it is, the implementation(s) could vary fairly widely.

If the goal is to achieve maximum performance, sometimes you just have to not rely on technology to figure it out for you, rather do it directly. The best advice for this situation is the use of inline assembler to address the problem. Doing so allows you to avoid all of the pitfalls of a "black box" solution compliments of the heuristics of the compiler and optimizer and to finitely state your intent. The key benefit to the use of inline assembler is the ability to avoid any pushes/pops and extraneous "generalization" code in the solution to the memory copy problem and the ability to take direct advantage of the processor's ability to solve the problem. The down side is maintenance, but given that you really need only address Intel and AMD to cover most of the market, it's not insurmountable.

I might add, too, that this solution could well allow you to take advantage of multiple cores/threads and/or a GPU if/when available to do the copying in parallel and truly get a performance gain. While the latency might be higher, the throughput would very likely be much higher, as well. If, for example, you could take advantage of a GPU when present, you could well launch one kernel per copy and copy thousands of elements in a single operation.

The alternative to this is to depend on the compiler/optimizer to make the best guesses for you, use the 'const' and 'register' declarations where you can to offer the compiler hints and use magic numbers to branch based on "best solution" paths... this, however, is going to be exceptionally compiler/system dependent and your mileage will vary widely from one platform/environment to another.

bazza

SSE/AVX and Alignment

If you're on, for example, a modern-ish Intel processor then use of SSE or AVX instructions is an option. Whilst not specifically about GCC, see this If you're interested and flush with cache I think Intel do a version of their compiler suite for Linux as well as Windows, and I guess that comes with its own suite of libraries.

There's also this post.

Threads (eek)

I've had exactly this sort of problem fairly recently, a memcpy() taking too much time. In my instance it was one big memcpy() (1MByte or so) rather than a lot of smaller ones like you're doing.

I got very good mileage by writing my own multi-threaded memcpy() where the threads were persistent and got 'tasked' with a share of the job by a call my own pmemcpy() function. The persistent threads meant that the overhead was pretty low. I got a x4 improvement for 4 cores.

So if it were possible to break your loops down into a sensible number of threads (I went for one per available core), and you had the luxury of a few spare cores on your machine you might get a similar benefit.

What the real time crowd do - DMA

Just as an aside, I have the pleasure of playing around with some fairly exotic OpenVPX hardware. Basically it's a bunch of boards in a big box with a high speed serial RapidIO interconnect between them. Each board has a DMA engine that drives data across the sRIO to another board's memory.

The vendor I went to is pretty clever at how to maximise the use of a CPU. The clever bit is that the DMA engines are pretty smart - they can be programmed to do things like matrix transformations on the fly, strip mining, things like you're trying to do, etc. And because it's a separate piece of hardware the CPU isn't tied up in the meantime, so that can be busy doing something else.

For example, if you're doing something like Synthetic Aperture Radar processing you always end up doing a big matrix transform. The beauty is that the transform itself takes no CPU time at all - you just move the data to another board and it arrives already transformed.

Anyway, having the benefit of that sort of thing really makes one wish that Intel CPUs (and other) have onboard DMA engines capable of working memory-memory instead of just memory-peripheral. That would make tasks like yours really quick.

I think the best way is to experiment and find out the optimal "k" value to switch between the original algorithm (with a loop) and your optimized algorithm using memcpy. The optimal "k" will vary across different CPU's, but shouldn't be drastically different; essentially it's about the overhead of calling memcpy, overhead in memcpy itself in choosing the optimal algorithm (based on size, alignment, etc.) vs. the "naive" algorithm with a loop.

memcpy is an intrinsic in gcc, yes, but it doesn't do magic. What it basically does is that if the size argument is known at compile-time and small-ish (I don't know what the threshold is), then GCC will replace the call to the memcpy function with inline code. If the size argument is not known at compile time, a call to the library function memcpy will always be made.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!