Repeat sections of dataframe based on a column value

问题

I'm collecting data over the course of many days and rather than filling it in for every day, I can elect to say that the data from one day should really be a repeat of another day. I'd like to repeat some of the rows from my existing data frame into the days specified as repeats. I have a column that indicates which day the current day is to repeat from but I am getting stuck with errors.

I have found ways to repeat rows n times based a column value but I am trying to use a column as an index to repeat data from previous rows.

I'd like to copy parts of my "Data" column for Day 1 into the "Data" column for Day 3 , using my "Repeat" Column as the index. I would like to do this for many more different days.

data = [['1', 5,np.NaN], ['1',5,np.NaN],['1',5,np.NaN], ['2', 6,np.NaN],['2', 6,np.NaN],['2', 6,np.NaN], ['3',np.NaN,1], ['3',np.NaN,np.NaN],['3', np.NaN,np.NaN]] 

df = pd.DataFrame(data, columns = ['Day', 'Data','repeat_tag'])

回答1:

I slightly extended your test data:

data = [['1', 51, np.nan], ['1', 52, np.nan],     ['1', 53, np.nan],
        ['2', 61, np.nan], ['2', 62, np.nan],     ['2', 63, np.nan],
        ['3', np.nan, 1],  ['3', np.nan, np.nan], ['3', np.nan, np.nan],
        ['4', np.nan, 2],  ['4', np.nan, np.nan], ['4', np.nan, np.nan]]
df = pd.DataFrame(data, columns = ['Day', 'Data', 'repeat_tag'])

Details:

There are 4 days with observations.
Each observation has different value (Data).
To avoid "single day copy", values for day '3' are to be copied from day '1' and for day '4' from day '2'.

I assume that non-null value of repeat_tag can be placed in only one observation for the "target" day.

I also added obsNo column to identify observations within particular day:

df['obsNo'] = df.groupby('Day').cumcount().add(1);

(it will be necessary later).

The first step of actual processing is to generate replDays table, where Day column is the target day and repeat_tag is the source day:

replDays = df.query('repeat_tag.notnull()')[['Day', 'repeat_tag']]
replDays.repeat_tag = replDays.repeat_tag.astype(int).apply(str)

A bit of type manipulation was with repeat_tag column. As this column contains NaN values and non-null values are int, this column is coerced to float64. Hence, to get string type (comparable with Day) it must be converted:

First to int, to drop the decimal part.
Then to str.

The result is:

  Day repeat_tag
6   3          1
9   4          2

(fill data for day 3 with data from day 1 and data for day 4 with data from day 2).

The next step is to generate replData table:

replData = pd.merge(replDays, df, left_on='repeat_tag', right_on='Day',
    suffixes=('_src', ''))[['Day_src', 'Day', 'Data', 'obsNo']]\
    .set_index(['Day_src', 'obsNo']).drop(columns='Day')

The result is:

               Data
Day_src obsNo      
3       1      51.0
        2      52.0
        3      53.0
4       1      61.0
        2      62.0
        3      63.0

As you can see:

There is only one column of replacement data - Data (from day 1 and 2).
MutliIndex contains both the day and observation number (both will be needed for proper update).

And the final part includes the following steps:

Copy df to res (result), setting index to Day and obsNo (required for update).
Update this table with data from replData.
Move Day and obsNo from index back to "regular" columns.

The code is:

res = df.copy().set_index(['Day', 'obsNo'])
res.update(replData)
res.reset_index(inplace=True)

If you want, you can alse drop obsNo column.

And a remark concerning the solution by Peter: If source data contains for any day different values, his code fails with InvalidIndexError, probably due to lack of identification of individual observations within particular day. This confirms that my idea to add obsNo column is valid.

回答2:

Setup

# Start with Valdi_Bo's expanded example data
data = [['1', 51, np.nan], ['1', 52, np.nan],     ['1', 53, np.nan],
        ['2', 61, np.nan], ['2', 62, np.nan],     ['2', 63, np.nan],
        ['3', np.nan, 1],  ['3', np.nan, np.nan], ['3', np.nan, np.nan],
        ['4', np.nan, 2],  ['4', np.nan, np.nan], ['4', np.nan, np.nan]]
df = pd.DataFrame(data, columns = ['Day', 'Data', 'repeat_tag'])

# Convert Day to integer data type
df['Day'] = df['Day'].astype(int)

# Spread repeat_tag values into all rows of tagged day
df['repeat_tag'] = df.groupby('Day')['repeat_tag'].ffill()

Solution

# Within each day, assign a number to each row
df['obs'] = df.groupby('Day').cumcount()

# Self-join
filler = (pd.merge(df, df, 
                   left_on=['repeat_tag', 'obs'], 
                   right_on=['Day', 'obs'])
            .set_index(['Day_x', 'obs'])['Data_y'])

# Fill missing data
df = df.set_index(['Day', 'obs'])
df.loc[df['Data'].isnull(), 'Data'] = filler
df = df.reset_index()

Result

df
    Day  obs  Data  repeat_tag
0     1    0  51.0         NaN
1     1    1  52.0         NaN
2     1    2  53.0         NaN
3     2    0  61.0         NaN
4     2    1  62.0         NaN
5     2    2  63.0         NaN
6     3    0  51.0         1.0
7     3    1  52.0         1.0
8     3    2  53.0         1.0
9     4    0  61.0         2.0
10    4    1  62.0         2.0
11    4    2  63.0         2.0

来源：https://stackoverflow.com/questions/56716238/repeat-sections-of-dataframe-based-on-a-column-value

标签

python

pandas