I am trying to read a .csv file called ratings.csv from http://grouplens.org/datasets/movielens/20m/ the file is 533.4MB in my computer.
This is what am writing in j
try like this - 1) load with dask and then 2) convert to pandas
import pandas as pd
import dask.dataframe as dd
import time
t=time.clock()
df_train = dd.read_csv('../data/train.csv')
df_train=df_train.compute()
print("load train: " , time.clock()-t)
You should consider using the chunksize
parameter in read_csv when reading in your dataframe, because it returns a TextFileReader
object you can then pass to pd.concat
to concatenate your chunks.
chunksize = 100000
tfr = pd.read_csv('./movielens/ratings.csv', chunksize=chunksize, iterator=True)
df = pd.concat(tfr, ignore_index=True)
If you just want to process each chunk individually, use,
chunksize = 20000
for chunk in pd.read_csv('./movielens/ratings.csv',
chunksize=chunksize,
iterator=True):
do_something_with_chunk(chunk)