How should I write multiple CSV files efficiently using dask.dataframe?

丶灬走出姿态 提交于 2021-02-10 04:46:22

问题


Here is the summary of what I'm doing:

At first, I do this by normal multiprocessing and pandas package:

Step 1. Get the list of files name which I'm gonna to read

import os    
files = os.listdir(DATA_PATH + product)

Step 2. loop over the list

from multiprocessing import Pool
import pandas as pd    

def readAndWriteCsvFiles(file):
    ### Step 2.1 read csv file into dataframe 
    data = pd.read_csv(DATA_PATH + product + "/" + file, parse_dates=True, infer_datetime_format=False)

    ### Step 2.2 do some calculation
    ### .......

    ### Step 2.3 write the dataframe to csv to another folder
    data.to_csv("another folder/"+file)

if __name__ == '__main__':
    cl = Pool(4)
    cl.map(readAndWriteCsvFiles, files, chunksize=1)
    cl.close()
    cl.join()  

The code works fine, but it's very slow.

It needs about 1000 second to do the task.

Compare to R programme using library(parallel) and parSapply function.

The R programme only takes about 160 seconds.

So then I tried with dask.delayed and dask.dataframe with following code:

Step 1. Get the list of files name which I'm gonna to read

import os    
files = os.listdir(DATA_PATH + product)

Step 2. loop over the list

from dask.delayed import delayed
import dask.dataframe as dd
from dask import compute

def readAndWriteCsvFiles(file):
    ### Step 2.1 read csv file into dataframe 
    data = dd.read_csv(DATA_PATH + product + "/" + file, parse_dates=True, infer_datetime_format=False, assume_missing=True)

    ### Step 2.2 do some calculation
    ### .......

    ### Step 2.3 write the dataframe to csv to another folder
    data.to_csv(filename="another folder/*", name_function=lambda x: file)

compute([delayed(readAndWriteCsvFiles)(file) for file in files])

This time, I found if I commented out both step 2.3 in dask code and pandas code, dask would run way more faster then normal pandas and multiprocessing.

But if I invoke the to_csv method, then dask is as slow as pandas.

Any solution?

Thanks


回答1:


Reading and writing CSV files is often bound by the GIL. You might want to try parallelizing with processes rather than with threads (the default for dask delayed).

You can achieve this by adding the scheduler='processes' keyword to your compute call.

compute([delayed(readAndWriteCsvFiles)(file) for file in files], scheduler='processes')

See scheduling documentation for more information

Also, note that you're not using dask.dataframe here, but rather dask.delayed.



来源:https://stackoverflow.com/questions/52342245/how-should-i-write-multiple-csv-files-efficiently-using-dask-dataframe

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!