Cumulative operations on dtype objects

后端 未结 3 1162
清歌不尽
清歌不尽 2020-12-19 09:02

I am trying to figure out how I can apply cumulative functions to objects. For numbers there are several alternatives like cumsum and cumcount. The

相关标签:
3条回答
  • 2020-12-19 09:07

    I think you can use cumsum with exception set, then you need first convert to list and then to set. Btw, storing set (C2) or lists of lists (C4) in columns in DataFrame is not recommended.

    print df
       C1   C2 C3   C4
    0   1  {A}  A  [A]
    1   2  {B}  B  [B]
    2   3  {C}  C  [C]
    3   4  {D}  D  [D]
    
    print df[['C1','C3','C4']].cumsum()
       C1    C3            C4
    0   1     A           [A]
    1   3    AB        [A, B]
    2   6   ABC     [A, B, C]
    3  10  ABCD  [A, B, C, D]
    
    df['C2'] = df['C2'].apply(list)
    df = df.cumsum()
    df['C2'] = df['C2'].apply(set)
    print df
       C1            C2    C3            C4
    0   1           {A}     A           [A]
    1   3        {A, B}    AB        [A, B]
    2   6     {A, C, B}   ABC     [A, B, C]
    3  10  {A, C, B, D}  ABCD  [A, B, C, D]
    
    0 讨论(0)
  • 2020-12-19 09:13

    Turns out this cannot be done.

    Continuing on the same sample:

    def burndowntheworld(ser):
        print('Are you sure?')
        return ser/0
    
    df.select_dtypes(['object']).expanding().apply(burndowntheworld)
    Out: 
        C2 C3   C4
    0  {A}  A  [A]
    1  {B}  B  [B]
    2  {C}  C  [C]
    3  {D}  D  [D]
    

    If the column's type is object, the function is never called. And pandas doesn't have an alternative that works on objects. It's the same for rolling().apply().

    In some sense, this is a good thing because expanding.apply with a custom function has O(n**2) complexity. With special cases like cumsum, ewma etc, the recursive nature of the operations can decrease the complexity to linear time but in the most general case it should calculate the function for the first n elements, and then for the first n+1 elements and so on. Therefore, especially for a function which is only dependent on the current value and function's previous value, expanding is quite inefficient. Not to mention storing lists or sets in a DataFrame is never a good idea to begin with.

    So the answer is: if your data is not numeric and the function is dependent on the previous result and the current element, just use a for loop. It will be more efficient anyway.

    0 讨论(0)
  • 2020-12-19 09:25

    well, you can define a custom function

    def custom_cumsum(df):
        from functools import reduce
        nrows, ncols = df.shape
        index, columns = df.index, df.columns
        rets = {}
        new_col = None
        for col in df.columns:
            try:
                new_col = {col:df.loc[:, col].cumsum()}
            except TypeError as e:
                if 'set' in str(e):
                    new_col = {col:[ reduce(set.union, df.loc[:, col][:(i+1)]) for i in range(nrows)]}
            rets.update(new_col)
        frame = pd.DataFrame(rets, index=index, columns=columns)
        return frame
    
    0 讨论(0)
提交回复
热议问题