I have a big file (19GB or so) that I want to load in memory to perform an aggregation over some columns.
the file looks like this:
id, col1, col2,
Dask.dataframe can almost do this without modification
$ cat so.csv
id,col1,col2,col3
1,13,15,14
1,13,15,14
1,12,15,13
2,18,15,13
2,18,15,13
2,18,15,13
2,18,15,13
2,18,15,13
2,18,15,13
3,14,15,13
3,14,15,13
3,14,185,213
$ pip install dask[dataframe]
$ ipython
In [1]: import dask.dataframe as dd
In [2]: df = dd.read_csv('so.csv', sep=',')
In [3]: df.head()
Out[3]:
id col1 col2 col3
0 1 13 15 14
1 1 13 15 14
2 1 12 15 13
3 2 18 15 13
4 2 18 15 13
In [4]: df.groupby(['id', 'col1']).sum().compute()
Out[4]:
col2 col3
id col1
1 12 15 13
13 30 28
2 18 90 78
3 14 215 239
No one has written as_index=False
for groupby though. We can work around this with assign
.
In [5]: df.assign(id_2=df.id, col1_2=df.col1).groupby(['id_2', 'col1_2']).sum().compute()
Out[5]:
id col1 col2 col3
id_2 col1_2
1 12 1 12 15 13
13 2 26 30 28
2 18 12 108 90 78
3 14 9 42 215 239
We'll pull out chunks and do groupbys just like in your first example. Once we're done grouping and summing each of the chunks we'll gather all of the intermediate results together and do another slightly different groupby.sum
. This makes the assumption that the intermediate results will fit in memory.
As a pleasant side effect, this will also operate in parallel.