Python / OpenCV application lockup issue

问题

My Python application running on a 64-core Linux box normally runs without a problem. Then after some random length of time (around 0.5 to 1.5 days usually) I suddenly start getting frequent pauses/lockups of over 10 seconds! During these lockups the system CPU time (i.e. time in the kernel) can be over 90% (yes: 90% of all 64 cores, not of just one CPU).

My app is restarted often throughout the day. Restarting the app does not fix the problem. However, rebooting the machine does.

Question 1: What could cause 90% system CPU time for 10 seconds? All of the system CPU time is in my parent Python process, not in the child processes created through Python's multiprocessing or other processes. So that means something of the order of 60+ threads spending 10+ seconds in the kernel. I am not even sure if this is a Python issue or a Linux kernel issue.

Question 2: That a reboot fixes the problem must be a big clue as to the cause. What Linux resources could be left exhausted on the system between my app restarting, but not between reboots, that could cause this problem to get stuck on?

What I've tried so far to solve this / figure it out

Below I will mention multiprocessing a lot. That's because the application runs in a cycle and multiprocessing is only used in one part of the cycle. The high CPU almost always happens immediately after all the multiprocessing calls finish. I'm not sure if this is a hint at the cause or a red herring.

My app runs a thread that uses psutil to log out the process and system CPU stats every 0.5 seconds. I have independently confirmed what it's reporting with top.
I've converted my app from Python 2.7 to Python 3.4 because Python 3.2 got a new GIL implementation and 3.4 had the multiprocessing rewritten. While this improved things it did not solve the problem (see my previous SO question which I'm leaving because it's still a useful answer, if not the total answer).
I have replaced the OS. Originally it was Ubuntu 12 LTS, now it's CentOS 7. No difference.
It turns out multithreading and multiprocessing clash in Python/Linux and are not recommended together, Python 3.4 now has forkserver and spawn multiprocessing contexts. I've tried them, no difference.
I've checked /dev/shm to see if I'm running out of shared memory (which Python 3.4 uses to manage multiprocessing), nothing
lsof output listing all resource here
It's difficult to test on other machines because I run a multiprocess Pool of 59 children and I don't have any other 64 core machines just lying around
I can't run it using threads rather than processes because it just can't run fast enough due to the GIL (hence why I switched to multiprocessing in the first place)
I've tried using strace on just one thread that is running slow (it can't run across all threads because it slows the app far too much). Below is what I got which doesn't tell me much.
ltrace does not work because you can't use -p on a thread ID. Even just running it on the main thread (no -f) makes the app so slow that the problem doesn't show up.
The problem is not related to load. It will sometimes run fine at full load, and then later at half load, it'll suddenly get this problem.
Even if I reboot the machine nightly the problem comes back every couple of days.

Environment / notes:

Python 3.4.3 compiled from source
CentOS 7 totally up to date. uname -a: Linux 3.10.0-229.4.2.el7.x86_64 #1 SMP Wed May 13 10:06:09 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux (although this kernel update was only applied today)
Machine has 128GB of memory and has plenty free
I use numpy linked to ATLAS. I'm aware that OpenBLAS clashes with Python multiprocessing but ATLAS does not, and that clash is solved by Python 3.4's forkserver and spawn which I've tried.
I use OpenCV which also does a lot of parallel work
I use ctypes to access a C .so library provided by a camera manufacturer
App runs as root (a requirement of a C library I link to)
The Python multiprocessing Pool is created in code guarded by if __name__ == "__main__": and in the main thread

Updated strace results

A few times I've managed to strace a thread that ran at 100% 'system' CPU. But only once have I gotten anything meaningful out of it. See below the call at 10:24:12.446614 that takes 1.4 seconds. Given it's the same ID (0x7f05e4d1072c) you see in most the other calls my guess would be this is Python's GIL synchronisation. Does this guess make sense? If so, then the question is why does the wait take 1.4 seconds? Is someone not releasing the GIL?

10:24:12.375456 futex(0x7f05e4d1072c, FUTEX_WAIT, 2, NULL) = 0 <0.000823>
10:24:12.377076 futex(0x7f05e4d1072c, FUTEX_WAIT, 2, NULL) = 0 <0.002419>
10:24:12.379588 futex(0x7f05e4d1072c, FUTEX_WAIT, 2, NULL) = 0 <0.001898>
10:24:12.382324 sched_yield()           = 0 <0.000186>
10:24:12.382596 futex(0x7f05e4d1072c, FUTEX_WAIT, 2, NULL) = 0 <0.004023>
10:24:12.387029 sched_yield()           = 0 <0.000175>
10:24:12.387279 futex(0x7f05e4d1072c, FUTEX_WAIT, 2, NULL) = 0 <0.054431>
10:24:12.442018 sched_yield()           = 0 <0.000050>
10:24:12.442157 futex(0x7f05e4d1072c, FUTEX_WAIT, 2, NULL) = 0 <0.003902>
10:24:12.446168 futex(0x7f05e4d1022c, FUTEX_WAKE, 1) = 1 <0.000052>
10:24:12.446316 futex(0x7f05e4d11cac, FUTEX_WAKE, 1) = 1 <0.000056>
10:24:12.446614 futex(0x7f05e4d1072c, FUTEX_WAIT, 2, NULL) = 0 <1.439739>
10:24:13.886513 futex(0x7f05e4d1072c, FUTEX_WAIT, 2, NULL) = 0 <0.002381>
10:24:13.889079 sched_yield()           = 0 <0.000016>
10:24:13.889135 sched_yield()           = 0 <0.000049>
10:24:13.889244 futex(0x7f05e4d1072c, FUTEX_WAIT, 2, NULL) = 0 <0.032761>
10:24:13.922147 sched_yield()           = 0 <0.000020>
10:24:13.922285 sched_yield()           = 0 <0.000104>
10:24:13.923628 futex(0x7f05e4d1072c, FUTEX_WAIT, 2, NULL) = 0 <0.002320>
10:24:13.926090 sched_yield()           = 0 <0.000018>
10:24:13.926244 futex(0x7f05e4d1072c, FUTEX_WAIT, 2, NULL) = 0 <0.000265>
10:24:13.926667 sched_yield()           = 0 <0.000027>
10:24:13.926775 sched_yield()           = 0 <0.000042>
10:24:13.926964 futex(0x7f05e4d1072c, FUTEX_WAIT, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable) <0.000117>
10:24:13.927241 futex(0x7f05e4d110ac, FUTEX_WAKE, 1) = 1 <0.000099>
10:24:13.927455 futex(0x7f05e4d11d2c, FUTEX_WAKE, 1) = 1 <0.000186>
10:24:13.931318 futex(0x7f05e4d1072c, FUTEX_WAIT, 2, NULL) = 0 <0.000678>

回答1:

I've managed to get a thread dump from gdb right at the point where 40+ threads are showing 100% 'system' CPU time.

Here's the backtrace which is the same for every one of those threads:

#0  0x00007fffebe9b407 in cv::ThresholdRunner::operator()(cv::Range const&) const () from /usr/local/lib/libopencv_imgproc.so.3.0
#1  0x00007fffecfe44a0 in tbb::interface6::internal::start_for<tbb::blocked_range<int>, (anonymous namespace)::ProxyLoopBody, tbb::auto_partitioner const>::execute() () from /usr/local/lib/libopencv_core.so.3.0
#2  0x00007fffe967496a in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all(tbb::task&, tbb::task*) () from /lib64/libtbb.so.2
#3  0x00007fffe96705a6 in tbb::internal::arena::process(tbb::internal::generic_scheduler&) () from /lib64/libtbb.so.2
#4  0x00007fffe966fc6b in tbb::internal::market::process(rml::job&) () from /lib64/libtbb.so.2
#5  0x00007fffe966d65f in tbb::internal::rml::private_worker::run() () from /lib64/libtbb.so.2
#6  0x00007fffe966d859 in tbb::internal::rml::private_worker::thread_routine(void*) () from /lib64/libtbb.so.2
#7  0x00007ffff76e9df5 in start_thread () from /lib64/libpthread.so.0
#8  0x00007ffff6d0e1ad in clone () from /lib64/libc.so.6

My original question put Python and Linux front and center but the issue appears to lie with TBB and/or OpenCV. Since OpenCV with TBB is so widely used I presume it has to also involve the interplay with my specific environment somehow. Maybe because it's a 64 core machine?

I have recompiled OpenCV with TBB turned off and the problem has not reappeared so far. But my app now runs slower.

I have posted this as a bug to OpenCV and will update this answer with anything that comes from that.

来源：https://stackoverflow.com/questions/30248295/python-opencv-application-lockup-issue

标签

python

multithreading

OpenCV

multiprocessing