问题
I'm working with big CSV files and I need to make a Cartesian Product (merge operation). I've tried to face the problem with Pandas (you can check Panda's code and a data format example for the same problem, here) without success due to memory errors. Now, I'm trying with Dask, which is supposed to manage huge datasets even when its size is bigger than the available RAM.
First of all I read both CSV:
from dask import dataframe as dd
BLOCKSIZE = 64000000 # = 64 Mb chunks
df1_file_path = './mRNA_TCGA_breast.csv'
df2_file_path = './miRNA_TCGA_breast.csv'
# Gets Dataframes
df1 = dd.read_csv(
df1_file_path,
delimiter='\t',
blocksize=BLOCKSIZE
)
first_column = df1.columns.values[0]
df1.set_index(first_column)
df2 = dd.read_csv(
df2_file_path,
delimiter='\t',
blocksize=BLOCKSIZE
)
first_column = df2.columns.values[0]
df2.set_index(first_column)
# Filter common columns
common_columns = df1.columns.intersection(df2.columns)
df1 = df1[common_columns]
df2 = df2[common_columns]
Then, I make the operation storing on disk to prevent memory errors:
# Computes a Cartesian product
df1['_tmpkey'] = 1
df2['_tmpkey'] = 1
# Neither of these two options work
# df1.merge(df2, on='_tmpkey').drop('_tmpkey', axis=1).to_hdf('/tmp/merge.*.hdf', key='/merge_data')
# df1.merge(df2, on='_tmpkey').drop('_tmpkey', axis=1).to_parquet('/tmp/')
I've made a repo to try with exactly the same CSV files that I'm using. I've tried with smaller blocksize
values but I got the same error. Am I missing something? Any kind of help would be really appreciated.
回答1:
I successfully ran your code using the following method with a memory limited to 32GB.
I've get rid of the argument BLOCKSIZE
and used repartition
instead on df1 and df2.
df1 = df1.repartition(npartitions=50)
df2 = df2.repartition(npartitions=1)
Notice that the size of df2 is really smaller compared to df1 (2.5 MB vs 23.75 MB), that's why I've kept only one partition for df2 and cut df1 into 50 partitions.
And doing so should make the code work for you. For me, the memory used stayed below 12GB.
To check, I've computed the len of the result :
len(df) # 3001995
Following what's above creates a parquet file with 50 partitions.
You can use repartition
again to get to the partition_size you want.
NB:
Adding this should speed up your code:
from dask.distributed import Client
client = Client()
In my case, I had to use the argument Client(processes=False)
because of my running environment.
来源:https://stackoverflow.com/questions/59865572/dask-running-out-of-memory-even-with-chunks