concurrent.futures not parallelizing write

问题

I have a list dataframe_chunk which contains chunks of a very large pandas dataframe.I would like to write every single chunk into a different csv, and to do so in parallel. However, I see the files being written sequentially and I'm not sure why this is the case. Here's the code:

import concurrent.futures as cfu

def write_chunk_to_file(chunk, fpath):  
    chunk.to_csv(fpath, sep=',', header=False, index=False)

pool = cfu.ThreadPoolExecutor(N_CORES)

futures = []
for i in range(N_CORES):
    fpath = '/path_to_files_'+str(i)+'.csv'
    futures.append(pool.submit( write_chunk_to_file(dataframe_chunk[i], fpath)))

for f in cfu.as_completed(futures):
    print("finished at ",time.time())

Any clues?

回答1:

One thing that is stated in the Python 2.7.x threading docs but not in the 3.x docs is that Python cannot achieve true parallelism using the threading library - only one thread will execute at a time.

You should try using concurrent.futures with the ProcessPoolExecutor which uses separate processes for each job and therefore can achieve true parallelism on a multi-core CPU.

Update

Here is your program adapted to use the multiprocessing library instead:

#!/usr/bin/env python3

from multiprocessing import Process

import os
import time

N_CORES = 8

def write_chunk_to_file(chunk, fpath):  
    with open(fpath, "w") as f:
      for x in range(10000000):
        f.write(str(x))

futures = []

print("my pid:", os.getpid())
input("Hit return to start:")

start = time.time()
print("Started at:", start)

for i in range(N_CORES):
    fpath = './tmp/file-'+str(i)+'.csv'
    p = Process(target=write_chunk_to_file, args=(i,fpath))
    futures.append(p)

for p in futures:
  p.start()

print("All jobs started.")

for p in futures:
  p.join()

print("All jobs finished at ",time.time())

You can monitor the jobs with this shell command in another window:

while true; do clear; pstree 12345; ls -l tmp; sleep 1; done

(Replace 12345 with the pid emitted by the script.)

来源：https://stackoverflow.com/questions/38456458/concurrent-futures-not-parallelizing-write

标签

python

multithreading

python-3.x

concurrent.futures