Remove rows with duplicate indices (Pandas DataFrame and TimeSeries)

后端 未结 6 940
我寻月下人不归
我寻月下人不归 2020-11-22 14:31

I\'m reading some automated weather data from the web. The observations occur every 5 minutes and are compiled into monthly files for each weather station. Once I\'m done pa

6条回答
  •  盖世英雄少女心
    2020-11-22 15:15

    Oh my. This is actually so simple!

    grouped = df3.groupby(level=0)
    df4 = grouped.last()
    df4
                          A   B  rownum
    
    2001-01-01 00:00:00   0   0       6
    2001-01-01 01:00:00   1   1       7
    2001-01-01 02:00:00   2   2       8
    2001-01-01 03:00:00   3   3       3
    2001-01-01 04:00:00   4   4       4
    2001-01-01 05:00:00   5   5       5
    

    Follow up edit 2013-10-29 In the case where I have a fairly complex MultiIndex, I think I prefer the groupby approach. Here's simple example for posterity:

    import numpy as np
    import pandas
    
    # fake index
    idx = pandas.MultiIndex.from_tuples([('a', letter) for letter in list('abcde')])
    
    # random data + naming the index levels
    df1 = pandas.DataFrame(np.random.normal(size=(5,2)), index=idx, columns=['colA', 'colB'])
    df1.index.names = ['iA', 'iB']
    
    # artificially append some duplicate data
    df1 = df1.append(df1.select(lambda idx: idx[1] in ['c', 'e']))
    df1
    #           colA      colB
    #iA iB                    
    #a  a  -1.297535  0.691787
    #   b  -1.688411  0.404430
    #   c   0.275806 -0.078871
    #   d  -0.509815 -0.220326
    #   e  -0.066680  0.607233
    #   c   0.275806 -0.078871  # <--- dup 1
    #   e  -0.066680  0.607233  # <--- dup 2
    

    and here's the important part

    # group the data, using df1.index.names tells pandas to look at the entire index
    groups = df1.groupby(level=df1.index.names)  
    groups.last() # or .first()
    #           colA      colB
    #iA iB                    
    #a  a  -1.297535  0.691787
    #   b  -1.688411  0.404430
    #   c   0.275806 -0.078871
    #   d  -0.509815 -0.220326
    #   e  -0.066680  0.607233
    

提交回复
热议问题