I am using Python 2 subprocess with threading threads to take standard input, process it with binaries A, B, and C<
I think you are just being mislead by the way cProfile works. For example, here's a simple script that uses two threads:
#!/usr/bin/python
import threading
import time
def f():
time.sleep(10)
def main():
t = threading.Thread(target=f)
t.start()
t.join()
If I test this using cProfile, here's what I get:
>>> import test
>>> import cProfile
>>> cProfile.run('test.main()')
60 function calls in 10.011 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 10.011 10.011 :1()
1 0.000 0.000 10.011 10.011 test.py:10(main)
1 0.000 0.000 0.000 0.000 threading.py:1008(daemon)
2 0.000 0.000 0.000 0.000 threading.py:1152(currentThread)
2 0.000 0.000 0.000 0.000 threading.py:241(Condition)
2 0.000 0.000 0.000 0.000 threading.py:259(__init__)
2 0.000 0.000 0.000 0.000 threading.py:293(_release_save)
2 0.000 0.000 0.000 0.000 threading.py:296(_acquire_restore)
2 0.000 0.000 0.000 0.000 threading.py:299(_is_owned)
2 0.000 0.000 10.011 5.005 threading.py:308(wait)
1 0.000 0.000 0.000 0.000 threading.py:541(Event)
1 0.000 0.000 0.000 0.000 threading.py:560(__init__)
2 0.000 0.000 0.000 0.000 threading.py:569(isSet)
4 0.000 0.000 0.000 0.000 threading.py:58(__init__)
1 0.000 0.000 0.000 0.000 threading.py:602(wait)
1 0.000 0.000 0.000 0.000 threading.py:627(_newname)
5 0.000 0.000 0.000 0.000 threading.py:63(_note)
1 0.000 0.000 0.000 0.000 threading.py:656(__init__)
1 0.000 0.000 0.000 0.000 threading.py:709(_set_daemon)
1 0.000 0.000 0.000 0.000 threading.py:726(start)
1 0.000 0.000 10.010 10.010 threading.py:911(join)
10 10.010 1.001 10.010 1.001 {method 'acquire' of 'thread.lock' objects}
2 0.000 0.000 0.000 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
4 0.000 0.000 0.000 0.000 {method 'release' of 'thread.lock' objects}
4 0.000 0.000 0.000 0.000 {thread.allocate_lock}
2 0.000 0.000 0.000 0.000 {thread.get_ident}
1 0.000 0.000 0.000 0.000 {thread.start_new_thread}
As you can see, it says that almost all of the time is spent acquiring locks. Of course, we know that's not really an accurate representation of what the script was doing. All the time was actually spent in a time.sleep call inside f(). The high tottime of the acquire call is just because join was waiting for f to finish, which means it had to sit and wait to acquire a lock. However, cProfile doesn't show any time being spent in f at all. We can clearly see what is actually happening because the example code is so simple, but in a more complicated program, this output is very misleading.
You can get more reliable results by using another profiling library, like yappi:
>>> import test
>>> import yappi
>>> yappi.set_clock_type("wall")
>>> yappi.start()
>>> test.main()
>>> yappi.get_func_stats().print_all()
Clock type: wall
Ordered by: totaltime, desc
name #n tsub ttot tavg
:1 2/1 0.000025 10.00801 5.004003
test.py:10 main 1 0.000060 10.00798 10.00798
..2.7/threading.py:308 _Condition.wait 2 0.000188 10.00746 5.003731
..thon2.7/threading.py:911 Thread.join 1 0.000039 10.00706 10.00706
..ython2.7/threading.py:752 Thread.run 1 0.000024 10.00682 10.00682
test.py:6 f 1 0.000013 10.00680 10.00680
..hon2.7/threading.py:726 Thread.start 1 0.000045 0.000608 0.000608
..thon2.7/threading.py:602 _Event.wait 1 0.000029 0.000484 0.000484
..2.7/threading.py:656 Thread.__init__ 1 0.000064 0.000250 0.000250
..on2.7/threading.py:866 Thread.__stop 1 0.000025 0.000121 0.000121
..lib/python2.7/threading.py:541 Event 1 0.000011 0.000101 0.000101
..python2.7/threading.py:241 Condition 2 0.000025 0.000094 0.000047
..hreading.py:399 _Condition.notifyAll 1 0.000020 0.000090 0.000090
..2.7/threading.py:560 _Event.__init__ 1 0.000018 0.000090 0.000090
..thon2.7/encodings/utf_8.py:15 decode 2 0.000031 0.000071 0.000035
..threading.py:259 _Condition.__init__ 2 0.000064 0.000069 0.000034
..7/threading.py:372 _Condition.notify 1 0.000034 0.000068 0.000068
..hreading.py:299 _Condition._is_owned 3 0.000017 0.000040 0.000013
../threading.py:709 Thread._set_daemon 1 0.000018 0.000035 0.000035
..ding.py:293 _Condition._release_save 2 0.000019 0.000033 0.000016
..thon2.7/threading.py:63 Thread._note 7 0.000020 0.000020 0.000003
..n2.7/threading.py:1152 currentThread 2 0.000015 0.000019 0.000009
..g.py:296 _Condition._acquire_restore 2 0.000011 0.000017 0.000008
../python2.7/threading.py:627 _newname 1 0.000014 0.000014 0.000014
..n2.7/threading.py:58 Thread.__init__ 4 0.000013 0.000013 0.000003
..threading.py:1008 _MainThread.daemon 1 0.000004 0.000004 0.000004
..hon2.7/threading.py:569 _Event.isSet 2 0.000003 0.000003 0.000002
With yappi, it's much easier to see that the time is being spent in f.
I suspect that you'll find that in reality, most of your script's time is spent doing whatever work is being done in produceA, produceB, and produceC.