Using pandas to efficiently read in a large CSV file without crashing

半世苍凉 提交于 2019-12-30 10:44:26

问题


I am trying to read a .csv file called ratings.csv from http://grouplens.org/datasets/movielens/20m/ the file is 533.4MB in my computer.

This is what am writing in jupyter notebook

import pandas as pd
ratings = pd.read_cv('./movielens/ratings.csv', sep=',')

The problem from here is the kernel would break or die and ask me to restart and its keeps repeating the same. There is no any error. Please can you suggest any alternative of solving this, it is as if my computer has no capability of running this.

This works but it keeps rewriting

chunksize = 20000
for ratings in pd.read_csv('./movielens/ratings.csv', chunksize=chunksize):
ratings.append(ratings)
ratings.head()

Only the last chunk is written others are written-off


回答1:


You should consider using the chunksize parameter in read_csv when reading in your dataframe, because it returns a TextFileReader object you can then pass to pd.concat to concatenate your chunks.

chunksize = 100000
tfr = pd.read_csv('./movielens/ratings.csv', chunksize=chunksize, iterator=True)
df = pd.concat(tfr, ignore_index=True)

If you just want to process each chunk individually, use,

chunksize = 20000
for chunk in pd.read_csv('./movielens/ratings.csv', 
                         chunksize=chunksize, 
                         iterator=True):
    do_something_with_chunk(chunk)



回答2:


try like this - 1) load with dask and then 2) convert to pandas

import pandas as pd
import dask.dataframe as dd
import time
t=time.clock()
df_train = dd.read_csv('../data/train.csv')
df_train=df_train.compute()
print("load train: " , time.clock()-t)


来源:https://stackoverflow.com/questions/45870220/using-pandas-to-efficiently-read-in-a-large-csv-file-without-crashing

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!