Non-linear scaling of .NET operations on multi-core machine

[亡魂溺海] 提交于 2019-12-05 04:52:43

Take a look at this article: http://blogs.msdn.com/pfxteam/archive/2008/08/12/8849984.aspx

Specifically, limit memory allocations in the parallel region, and carefully inspect writes to make sure that they don't occur close to memory locations that other threads read or write.

So I finally figured out what the problem was - and I think it would be useful to share it with the SO community.

The entire issue with non-linear performance was the result of a single line inside the Evaluate() method:

var coordMatrix = new long[100];

Since Evaluate() is invoked millions of times, this memory allocation was occurring millions of times. As it happens, the CLR internally performs some inter-thread synchronization when allocating memory - otherwise allocation on multiple threads could inadvertently overlap. Changing the array from a method-local instance to a class instance that is only allocated once (but then initializing in a method-local loop) eliminated the scalability problem.

Normally, it's an antipattern to create a class-level member for a variable that is only used (and meaningful) within the scope of a single method. But in this case, since I need the greatest possible scalability, I will live with (and document) this optimization.

Epilogue: After I made this change, the concurrent process was able to achieve 12.2 million computations / sec.

P.S. Kudos to Igor Ostrovsky for his germane link to MSDN blogs which helped me identify and diagnose the problem.

Non-linear scaling is to be expected with a parallel algorithm in comparison with a sequential algorithm, since there is some inherent overhead in the parallelization. ( Ideally, of course, you want to get as close as you can.)

Additionally, there will usually be certain things you need to take care of in a parallel algorithm that you don't need to in a sequential algorithm. Beyond synchronization (which can really bog down your work), there are some other things that can happen:

  • The CPU and OS can't devote all of its time to your application. Thus, it needs to do context switching every now and again to let other processes get some work done. When you're only using a single core, it is less likely that your process is switched out, because it has three other cores to choose from. Note that even though you might think nothing else is running, the OS or some services could still be performing some background work.
  • If each of your threads is accessing a lot of data, and this data is not common between threads, you will most likely not be able to store all of this in the CPU cache. That means a lot more memory accessing is required, which is (relatively) slow.

As far as I can tell, your current explicit approach uses a shared iterator between the threads. That's an okay solution if the processing vary wildly throughout the array, but there is likely to be synchronization overhead to prevent an element from being skipped (retrieving the current element and moving the internal pointer to the next element needs to be an atomic operation to prevent skipping an element).

Therefore, it might be a better idea to partition the array, assuming the processing time of each element is expected to be roughly equal regardless of the position of the element. Given that you have 10 million records, that means telling thread 1 to work on elements 0 to 2,499,999, thread 2 works on elements 2,500,000 to 4,999,999, etc. You can assign each thread an ID and use this to calculate the actual range.

Another small improvement would be to let the main thread act as one of the threads that calculates. However, if I remember correctly, that's a very minor thing.

I certainly would not expect a linear relationship, but I would have thought you would have seen a bigger gain than that. I am assuming that the CPU usage is maxed out on all cores. Just a couple of thoughts off the top of my head.

  • Are you using any shared data structures (either explicitly or implicitly) that require synchronization?
  • Have you tried profiling or recording performance counters to determine where the bottleneck is? Can you give any more clues.

Edit: Sorry, I just noticed that you have already addressed both of my points.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!