pandas groupby with sum() on large csv file?

前端 未结 2 663
别那么骄傲
别那么骄傲 2020-12-10 06:16

I have a big file (19GB or so) that I want to load in memory to perform an aggregation over some columns.

the file looks like this:

id, col1, col2,         


        
2条回答
  •  感情败类
    2020-12-10 06:39

    dask solution

    Dask.dataframe can almost do this without modification

    $ cat so.csv
    id,col1,col2,col3
    1,13,15,14
    1,13,15,14
    1,12,15,13
    2,18,15,13
    2,18,15,13
    2,18,15,13
    2,18,15,13
    2,18,15,13
    2,18,15,13
    3,14,15,13
    3,14,15,13
    3,14,185,213
    
    $ pip install dask[dataframe]
    $ ipython
    
    In [1]: import dask.dataframe as dd
    
    In [2]: df = dd.read_csv('so.csv', sep=',')
    
    In [3]: df.head()
    Out[3]: 
       id  col1  col2  col3
    0   1    13    15    14
    1   1    13    15    14
    2   1    12    15    13
    3   2    18    15    13
    4   2    18    15    13
    
    In [4]: df.groupby(['id', 'col1']).sum().compute()
    Out[4]: 
             col2  col3
    id col1            
    1  12      15    13
       13      30    28
    2  18      90    78
    3  14     215   239
    

    No one has written as_index=False for groupby though. We can work around this with assign.

    In [5]: df.assign(id_2=df.id, col1_2=df.col1).groupby(['id_2', 'col1_2']).sum().compute()
    Out[5]: 
                 id  col1  col2  col3
    id_2 col1_2                      
    1    12       1    12    15    13
         13       2    26    30    28
    2    18      12   108    90    78
    3    14       9    42   215   239
    

    How this works

    We'll pull out chunks and do groupbys just like in your first example. Once we're done grouping and summing each of the chunks we'll gather all of the intermediate results together and do another slightly different groupby.sum. This makes the assumption that the intermediate results will fit in memory.

    Parallelism

    As a pleasant side effect, this will also operate in parallel.

提交回复
热议问题