How to import a gzip file larger than RAM limit into a Pandas DataFrame? “Kill 9” Use HDF5?

后端 未结 1 1114
感情败类
感情败类 2020-12-18 11:58

I have a gzip which is approximately 90 GB. This is well within disk space, but far larger than RAM.

How can I import this into a pandas dataframe? I t

相关标签:
1条回答
  • 2020-12-18 12:05

    I'd do it this way:

    filename = 'filename.gzip'      # size 90 GB
    hdf_fn = 'result.h5'
    hdf_key = 'my_huge_df'
    cols = ['colA','colB','colC','ColZ'] # put here a list of all your columns
    cols_to_index = ['colA','colZ'] # put here the list of YOUR columns, that you want to index
    chunksize = 10**6               # you may want to adjust it ... 
    
    store = pd.HDFStore(hdf_fn)
    
    for chunk in pd.read_table(filename, compression='gzip', header=None, names=cols, chunksize=chunksize):
        # don't index data columns in each iteration - we'll do it later
        store.append(hdf_key, chunk, data_columns=cols_to_index, index=False)
    
    # index data columns in HDFStore
    store.create_table_index(hdf_key, columns=cols_to_index, optlevel=9, kind='full')
    store.close()
    
    0 讨论(0)
提交回复
热议问题