How to create a lagged data structure using pandas dataframe

后端 未结 8 2250
耶瑟儿~
耶瑟儿~ 2020-12-04 15:24

Example

s=pd.Series([5,4,3,2,1], index=[1,2,3,4,5])
print s 
1    5
2    4
3    3
4    2
5    1

Is there an efficient way to create a serie

相关标签:
8条回答
  • 2020-12-04 15:56

    I like to put the lag numbers in the columns by making the columns a MultiIndex. This way, the names of the columns are retained.

    Here's an example of the result:

    # Setup
    indx = pd.Index([1, 2, 3, 4, 5], name='time')
    s=pd.Series(
        [5, 4, 3, 2, 1],
        index=indx,
        name='population')
    
    shift_timeseries_by_lags(pd.DataFrame(s), [0, 1, 2])
    

    Result: a MultiIndex DataFrame with two column labels: the original one ("population") and a new one ("lag"):


    Solution: Like in the accepted solution, we use DataFrame.shift and then pandas.concat.

    def shift_timeseries_by_lags(df, lags, lag_label='lag'):
        return pd.concat([
            shift_timeseries_and_create_multiindex_column(df, lag,
                                                          lag_label=lag_label)
            for lag in lags], axis=1)
    
    def shift_timeseries_and_create_multiindex_column(
            dataframe, lag, lag_label='lag'):
        return (dataframe.shift(lag)
                         .pipe(append_level_to_columns_of_dataframe,
                               lag, lag_label))
    

    I wish there were an easy way to append a list of labels to the existing columns. Here's my solution.

    def append_level_to_columns_of_dataframe(
            dataframe, new_level, name_of_new_level, inplace=False):
        """Given a (possibly MultiIndex) DataFrame, append labels to the column
        labels and assign this new level a name.
    
        Parameters
        ----------
        dataframe : a pandas DataFrame with an Index or MultiIndex columns
    
        new_level : scalar, or arraylike of length equal to the number of columns
        in `dataframe`
            The labels to put on the columns. If scalar, it is broadcast into a
            list of length equal to the number of columns in `dataframe`.
    
        name_of_new_level : str
            The label to give the new level.
    
        inplace : bool, optional, default: False
            Whether to modify `dataframe` in place or to return a copy
            that is modified.
    
        Returns
        -------
        dataframe_with_new_columns : pandas DataFrame with MultiIndex columns
            The original `dataframe` with new columns that have the given `level`
            appended to each column label.
        """
        old_columns = dataframe.columns
    
        if not hasattr(new_level, '__len__') or isinstance(new_level, str):
            new_level = [new_level] * dataframe.shape[1]
    
        if isinstance(dataframe.columns, pd.MultiIndex):
            new_columns = pd.MultiIndex.from_arrays(
                old_columns.levels + [new_level],
                names=(old_columns.names + [name_of_new_level]))
        elif isinstance(dataframe.columns, pd.Index):
            new_columns = pd.MultiIndex.from_arrays(
                [old_columns] + [new_level],
                names=([old_columns.name] + [name_of_new_level]))
    
        if inplace:
            dataframe.columns = new_columns
            return dataframe
        else:
            copy_dataframe = dataframe.copy()
            copy_dataframe.columns = new_columns
            return copy_dataframe
    

    Update: I learned from this solution another way to put a new level in a column, which makes it unnecessary to use append_level_to_columns_of_dataframe:

    def shift_timeseries_by_lags_v2(df, lags, lag_label='lag'):
        return pd.concat({
            '{lag_label}_{lag_number}'.format(lag_label=lag_label, lag_number=lag):
            df.shift(lag)
            for lag in lags},
            axis=1)
    

    Here's the result of shift_timeseries_by_lags_v2(pd.DataFrame(s), [0, 1, 2]):

    0 讨论(0)
  • 2020-12-04 15:56

    For multiple (many of them) lags, this could be more compact:

    df=pd.DataFrame({'year': range(2000, 2010), 'gdp': [234, 253, 256, 267, 272, 273, 271, 275, 280, 282]})
    df.join(pd.DataFrame({'gdp_' + str(lag): df['gdp'].shift(lag) for lag in range(1,4)}))
    
    0 讨论(0)
  • 2020-12-04 15:57

    You can do following:

    s=pd.Series([5,4,3,2,1], index=[1,2,3,4,5])
    res = pd.DataFrame(index = s.index)
    for l in range(3):
        res[l] = s.shift(l)
    print res.ix[3:,:].as_matrix()
    

    It produces:

    array([[ 3.,  4.,  5.],
           [ 2.,  3.,  4.],
           [ 1.,  2.,  3.]])
    

    which I hope is very close to what you are actually want.

    0 讨论(0)
  • 2020-12-04 16:10

    Assuming you are focusing on a single column in your data frame, saved into s. This shortcode will generate instances of the column with 7 lags.

    s=pd.Series([5,4,3,2,1], index=[1,2,3,4,5], name='test')
    shiftdf=pd.DataFrame()
    for i in range(3):
        shiftdf = pd.concat([shiftdf , s.shift(i).rename(s.name+'_'+str(i))], axis=1)
    
    shiftdf
    
    >>
    test_0  test_1  test_2
    1   5   NaN NaN
    2   4   5.0 NaN
    3   3   4.0 5.0
    4   2   3.0 4.0
    5   1   2.0 3.0
    
    0 讨论(0)
  • 2020-12-04 16:14

    For a dataframe df with the lag to be applied on 'col name', you can use the shift function.

    df['lag1']=df['col name'].shift(1)
    df['lag2']=df['col name'].shift(2)
    
    0 讨论(0)
  • 2020-12-04 16:16

    Here is a cool one liner for lagged features with _lagN suffixes in column names using pd.concat:

    lagged = pd.concat([s.shift(lag).rename('{}_lag{}'.format(s.name, lag+1)) for lag in range(3)], axis=1).dropna()
    
    0 讨论(0)
提交回复
热议问题