Using pandas to efficiently read in a large CSV file without crashing

后端 未结 2 1034
野趣味
野趣味 2020-12-17 05:55

I am trying to read a .csv file called ratings.csv from http://grouplens.org/datasets/movielens/20m/ the file is 533.4MB in my computer.

This is what am writing in j

相关标签:
2条回答
  • 2020-12-17 06:02

    try like this - 1) load with dask and then 2) convert to pandas

    import pandas as pd
    import dask.dataframe as dd
    import time
    t=time.clock()
    df_train = dd.read_csv('../data/train.csv')
    df_train=df_train.compute()
    print("load train: " , time.clock()-t)
    
    0 讨论(0)
  • 2020-12-17 06:11

    You should consider using the chunksize parameter in read_csv when reading in your dataframe, because it returns a TextFileReader object you can then pass to pd.concat to concatenate your chunks.

    chunksize = 100000
    tfr = pd.read_csv('./movielens/ratings.csv', chunksize=chunksize, iterator=True)
    df = pd.concat(tfr, ignore_index=True)
    

    If you just want to process each chunk individually, use,

    chunksize = 20000
    for chunk in pd.read_csv('./movielens/ratings.csv', 
                             chunksize=chunksize, 
                             iterator=True):
        do_something_with_chunk(chunk)
    
    0 讨论(0)
提交回复
热议问题