Splitting pandas dataframe into many chunks

前端 未结 2 1454
天命终不由人
天命终不由人 2021-01-07 05:18

Let\'s say I have a dataframe with the following structure:

    observation
d1  1
d2  1
d3  -1
d4  -1
d5  -1
d6  -1
d7  1
d8  1
d9  1
d10 1
d11 -1
d12 -1
d13         


        
相关标签:
2条回答
  • 2021-01-07 05:34

    You can create a group variable based on the cumsum() of the diff() of the observation column where if the diff() is not equal to zero, assign a True value, thus every time a new value appears, a new group id will be created with the cumsum(), and then you can either apply standard analysis after groupby() with df.groupby((df.observation.diff() != 0).cumsum())...(other chained analysis here) or split them into smaller data frames with list-comprehension:

    lst = [g for _, g in df.groupby((df.observation.diff() != 0).cumsum())]
    
    lst[0]
    # observation
    #d1         1
    #d2         1
    
    lst[1]
    # observation
    #d3        -1
    #d4        -1
    #d5        -1
    #d6        -1
    ...
    

    Index chunks here:

    [i.index for i in lst]
    
    #[Index(['d1', 'd2'], dtype='object'),
    # Index(['d3', 'd4', 'd5', 'd6'], dtype='object'),
    # Index(['d7', 'd8', 'd9', 'd10'], dtype='object'),
    # Index(['d11', 'd12', 'd13', 'd14', 'd15'], dtype='object'),
    # Index(['d16', 'd17', 'd18', 'd19', 'd20'], dtype='object')]
    
    0 讨论(0)
  • 2021-01-07 05:44

    Here's an example using real date.datetime objects as indices.

    import pandas as pd
    import numpy as np
    import datetime
    import random
    
    df = pd.DataFrame({'x': np.random.randn(40)}, index = [date.fromordinal(random.randint(start_date, end_date)) for i in range(40)])
    
    def filter_on_datetime(df, year = None, month = None, day = None):
        if all(d is not None for d in {year, month, day}):
            idxs = [idx for idx in df.index if idx.year == year and idx.month == month and idx.day == day]
        elif year is not None and month is not None and day is None:
            idxs = [idx for idx in df.index if idx.year == year and idx.month == month]
        elif year is not None and month is None and day is None:
            idxs = [idx for idx in df.index if idx.year == year]
        elif year is None and month is not None and day is not None:
            idxs = [idx for idx in df.index if idx.month == month and idx.day == day]
        elif year is None and month is None and day is not None:
            idxs = [idx for idx in df.index if idx.day == day]
        elif year is None and month is not None and day is None:
            idxs = [idx for idx in df.index if idx.month == month]
        elif year is not None and month is None and day is not None:
            idxs = [idx for idx in df.index if idx.year == year and idx.day == day] 
        else:
            idxs = df.index
        return df.ix[idxs]
    

    Running this:

    >>> print(filter_on_datetime(df = df, year = 2016, month = 2))
                       x
    2016-02-01 -0.141557
    2016-02-03  0.162429
    2016-02-05  0.703794
    2016-02-07 -0.184492
    2016-02-09 -0.921793
    2016-02-12  1.593838
    2016-02-17  2.784899
    2016-02-19  0.034721
    2016-02-26 -0.142299
    
    0 讨论(0)
提交回复
热议问题