Computing diffs within groups of a dataframe

前端 未结 6 402
你的背包
你的背包 2020-11-30 19:21

Say I have a dataframe with 3 columns: Date, Ticker, Value (no index, at least to start with). I have many dates and many tickers, but each (ticker, date) tupl

6条回答
  •  难免孤独
    2020-11-30 19:44

    Ok. Lots of thinking about this, and I think this is my favorite combination of the solutions above and a bit of playing around. Original data lives in df:

    df.sort(['ticker', 'date'], inplace=True)
    
    # for this example, with diff, I think this syntax is a bit clunky
    # but for more general examples, this should be good.  But can we do better?
    df['diffs'] = df.groupby(['ticker'])['value'].transform(lambda x: x.diff()) 
    
    df.sort_index(inplace=True)
    

    This will accomplish everything I want. And what I really like is that it can be generalized to cases where you want to apply a function more intricate than diff. In particular, you could do things like lambda x: pd.rolling_mean(x, 20, 20) to make a column of rolling means where you don't need to worry about each ticker's data being corrupted by that of any other ticker (groupby takes care of that for you...).

    So here's the question I'm left with...why doesn't the following work for the line that starts df['diffs']:

    df['diffs'] = df.groupby[('ticker')]['value'].transform(np.diff)
    

    when I do that, I get a diffs column full of 0's. Any thoughts on that?

提交回复
热议问题