问题
I have a dataframe
:
df = pd.DataFrame.from_dict({
'product': ('a', 'a', 'a', 'a', 'c', 'b', 'b', 'b'),
'sales': ('-', '-', 'hot_price', 'hot_price', '-', 'min_price', 'min_price', 'min_price'),
'price': (100, 100, 50, 50, 90, 70, 70, 70),
'dt': ('2020-01-01 00:00:00', '2020-01-01 00:05:00', '2020-01-01 00:07:00', '2020-01-01 00:10:00', '2020-01-01 00:13:00', '2020-01-01 00:15:00', '2020-01-01 00:19:00', '2020-01-01 00:21:00')
})
product sales price dt
0 a - 100 2020-01-01 00:00:00
1 a - 100 2020-01-01 00:05:00
2 a hot_price 50 2020-01-01 00:07:00
3 a hot_price 50 2020-01-01 00:10:00
4 c - 90 2020-01-01 00:13:00
5 b min_price 70 2020-01-01 00:15:00
6 b min_price 70 2020-01-01 00:19:00
7 b min_price 70 2020-01-01 00:21:00
I need the next output:
product sales price dt unique_group
0 a - 100 2020-01-01 00:00:00 0
1 a - 100 2020-01-01 00:05:00 0
2 a hot_price 50 2020-01-01 00:07:00 1
3 a hot_price 50 2020-01-01 00:10:00 1
4 c - 90 2020-01-01 00:13:00 2
5 b min_price 70 2020-01-01 00:15:00 3
6 b min_price 70 2020-01-01 00:19:00 3
7 b min_price 70 2020-01-01 00:21:00 3
How I do it:
unique_group = 0
df['unique_group'] = unique_group
for i in range(1, len(df)):
current, prev = df.loc[i], df.loc[i - 1]
if not all([
current['product'] == prev['product'],
current['sales'] == prev['sales'],
current['price'] == prev['price'],
]):
unique_group += 1
df.loc[i, 'unique_group'] = unique_group
Is it possible to do it without iteration? I tried using cumsum()
, shift()
, ngroup()
, drop_duplicates()
but unsuccessfully.
回答1:
IIUC, GroupBy.ngroup:
df['unique_group'] = df.groupby(['product', 'sales', 'price'],sort=False).ngroup()
print(df)
product sales price dt unique_group
0 a - 100 2020-01-01 00:00:00 0
1 a - 100 2020-01-01 00:05:00 0
2 a hot_price 50 2020-01-01 00:07:00 1
3 a hot_price 50 2020-01-01 00:10:00 1
4 c - 90 2020-01-01 00:13:00 2
5 b min_price 70 2020-01-01 00:15:00 3
6 b min_price 70 2020-01-01 00:19:00 3
7 b min_price 70 2020-01-01 00:21:00 3
this works either way, even if the data frame is not ordered
Another approach
this works with the ordered data frame
cols = ['product','sales','price']
df['unique_group'] = df[cols].ne(df[cols].shift()).any(axis=1).cumsum().sub(1)
回答2:
Another option which might be a bit faster than groupby
:
df['unique_group'] = (~df.duplicated(['product','sales','price'])).cumsum() - 1
Output:
product sales price dt unique_group
0 a - 100 2020-01-01 00:00:00 0
1 a - 100 2020-01-01 00:05:00 0
2 a hot_price 50 2020-01-01 00:07:00 1
3 a hot_price 50 2020-01-01 00:10:00 1
4 c - 90 2020-01-01 00:13:00 2
5 b min_price 70 2020-01-01 00:15:00 3
6 b min_price 70 2020-01-01 00:19:00 3
7 b min_price 70 2020-01-01 00:21:00 3
来源:https://stackoverflow.com/questions/60727830/auto-increment-inside-group