What's the most efficient way to convert a time-series data into a cross-sectional one?

本小妞迷上赌 提交于 2021-02-11 14:30:25


Here's the thing, I have the dataset below where date is the index:

date            value
2020-01-01      100
2020-02-01      140
2020-03-01      156
2020-04-01      161
2020-05-01      170

And I want to transform it in this other dataset:

value_t0    value_t1    value_t2    value_t3    value_t4 ...
100         NaN         NaN         NaN         NaN      ...
140         100         NaN         NaN         NaN      ...
156         140         100         NaN         NaN      ...
161         156         140         100         NaN      ...
170         161         156         140         100      ...

First I thought about using pandas.pivot_table to do something, but that would just provide a different layout grouped by some column, which is not exactly what I want. Later, I thought about using pandasql and apply 'case when', but that wouldn't work because I would have to type dozens of lines of code. So I'm stuck here.


try this:

new_df = pd.DataFrame({f"value_t{i}": df['value'].shift(i) for i in range(len(df))})

The series .shift(n) method can get you a single column of your desired output by shifting everything down and filling in NaNs above. So we're building a new dataframe by feeding it a dictionary of the form {column name: column data, ...}, by using dictionary comprehension to iterate through your original dataframe.


I think the best is use numpy

values = np.asarray(df['value'].astype(float))
new_values = np.tril(np.repeat([values], values.shape[0], axis=0).T)
new_values[np.triu_indices(new_values.shape[0], 1)] = np.nan
new_df = pd.DataFrame(new_values).add_prefix('value_t')

Times for 5000 rows

values = np.asarray(df['value'].astype(float))
new_values = np.tril(np.repeat([values], values.shape[0], axis=0).T)
new_values[np.triu_indices(new_values.shape[0],1)] = np.nan
new_df = pd.DataFrame(new_values).add_prefix('value_t')
556 ms ± 35.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

new_df = pd.DataFrame({f"value_t{i}": df['value'].shift(i) for i in range(len(df))})
1.31 s ± 36.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Time without add_prefix

values = np.asarray(df['value'].astype(float))
new_values = np.tril(np.repeat([values], values.shape[0], axis=0).T)
new_values[np.triu_indices(new_values.shape[0],1)] = np.nan
new_df = pd.DataFrame(new_values)

357 ms ± 8.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

