Multithreading - How to use CPU as much as possible?

霸气de小男生 提交于 2021-02-07 11:08:46

问题


I'm currently implementing Tensorflow custom op(for custom data fetcher) using C++ in order to speed up my Tensorflow model. Since my Tensorflow model doesn't use GPU a lot, I believe I can achieve maximal performance using multiple worker threads concurrently.

The problem is, even though I have enough workers, my program doesn't utilize all CPU. In my development machine, (4 physical core) it uses about 90% of user time, 4% of sys time with 4 worker threads and tf.ConfigProto(inter_op_parallelism_threads=6)options.

With more worker threads and inter_op_parallelism_threads options, I get much worse model running performance than previous configuration. Since I don't good at prpfiling I don't know where is the bottleneck of my code.

Is there any rule of thumbs to maximize CPU usage and/or good tools to find performance bottleneck/mutex lock for single process(not system-wide) in Linux?

EDIT: My code runs python, but (almost) every executions are in C++ code. Some of them are not mine(Tensorflow and and Eigen), and I've made a shared library that can be dynamically loaded in Python and it is being called by Tensorflow kernel. Tensorflow owns their thread pool and my dynamic library code also owns thread pool, and my code is thread safe. I also create threads to call sess.run() concurrently in order to call them. Like Python can call multiple HTTP requests concurrently, sess.run() release GIL. My object is call sess.run() as much as possible to increase "real" performance, and any python-related profiler wasn't succesful.


回答1:


1) More threads does not mean more speed. If you have 4 cores, you cannot go any faster than 4 times 1 core.

2) What you should do is tune your code for maximum performance in single-thread execution (with compiler optimization turned off), and after you have done that, turn on the compiler's optimizer and make the code multi-threaded, with no more threads than you have cores.

P.S. It is a common misconception that performance tuning can only be done on compiler-optimized code. This explains why it's not so.



来源:https://stackoverflow.com/questions/40218075/multithreading-how-to-use-cpu-as-much-as-possible

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!