How to create a lagged data structure using pandas dataframe

后端未结

关注

 8  2256

耶瑟儿～ 2020-12-04 15:24

Example

s=pd.Series([5,4,3,2,1], index=[1,2,3,4,5])
print s 
1    5
2    4
3    3
4    2
5    1

Is there an efficient way to create a serie

8条回答

不思量自难忘° (楼主)

2020-12-04 15:56

I like to put the lag numbers in the columns by making the columns a MultiIndex. This way, the names of the columns are retained.

Here's an example of the result:

# Setup
indx = pd.Index([1, 2, 3, 4, 5], name='time')
s=pd.Series(
    [5, 4, 3, 2, 1],
    index=indx,
    name='population')

shift_timeseries_by_lags(pd.DataFrame(s), [0, 1, 2])

Result: a MultiIndex DataFrame with two column labels: the original one ("population") and a new one ("lag"):

Solution: Like in the accepted solution, we use DataFrame.shift and then pandas.concat.

def shift_timeseries_by_lags(df, lags, lag_label='lag'):
    return pd.concat([
        shift_timeseries_and_create_multiindex_column(df, lag,
                                                      lag_label=lag_label)
        for lag in lags], axis=1)

def shift_timeseries_and_create_multiindex_column(
        dataframe, lag, lag_label='lag'):
    return (dataframe.shift(lag)
                     .pipe(append_level_to_columns_of_dataframe,
                           lag, lag_label))

I wish there were an easy way to append a list of labels to the existing columns. Here's my solution.

def append_level_to_columns_of_dataframe(
        dataframe, new_level, name_of_new_level, inplace=False):
    """Given a (possibly MultiIndex) DataFrame, append labels to the column
    labels and assign this new level a name.

    Parameters
    ----------
    dataframe : a pandas DataFrame with an Index or MultiIndex columns

    new_level : scalar, or arraylike of length equal to the number of columns
    in `dataframe`
        The labels to put on the columns. If scalar, it is broadcast into a
        list of length equal to the number of columns in `dataframe`.

    name_of_new_level : str
        The label to give the new level.

    inplace : bool, optional, default: False
        Whether to modify `dataframe` in place or to return a copy
        that is modified.

    Returns
    -------
    dataframe_with_new_columns : pandas DataFrame with MultiIndex columns
        The original `dataframe` with new columns that have the given `level`
        appended to each column label.
    """
    old_columns = dataframe.columns

    if not hasattr(new_level, '__len__') or isinstance(new_level, str):
        new_level = [new_level] * dataframe.shape[1]

    if isinstance(dataframe.columns, pd.MultiIndex):
        new_columns = pd.MultiIndex.from_arrays(
            old_columns.levels + [new_level],
            names=(old_columns.names + [name_of_new_level]))
    elif isinstance(dataframe.columns, pd.Index):
        new_columns = pd.MultiIndex.from_arrays(
            [old_columns] + [new_level],
            names=([old_columns.name] + [name_of_new_level]))

    if inplace:
        dataframe.columns = new_columns
        return dataframe
    else:
        copy_dataframe = dataframe.copy()
        copy_dataframe.columns = new_columns
        return copy_dataframe

Update: I learned from this solution another way to put a new level in a column, which makes it unnecessary to use append_level_to_columns_of_dataframe:

def shift_timeseries_by_lags_v2(df, lags, lag_label='lag'):
    return pd.concat({
        '{lag_label}_{lag_number}'.format(lag_label=lag_label, lag_number=lag):
        df.shift(lag)
        for lag in lags},
        axis=1)

Here's the result of shift_timeseries_by_lags_v2(pd.DataFrame(s), [0, 1, 2]):

0 讨论(0)

查看其它8个回答