pandas-groupby

Pandas groupby and aggregation output should include all the original columns (including the ones not aggregated on)

蹲街弑〆低调 提交于 2019-11-28 10:01:36
I have the following data frame and want to: Group records by month Sum QTY_SOLD and NET_AMT of each unique UPC_ID (per month) Include the rest of the columns as well in the resulting dataframe The way I thought I can do this is 1st: create a month column to aggregate the D_DATES , then sum QTY_SOLD by UPC_ID . Script: # Convert date to date time object df['D_DATE'] = pd.to_datetime(df['D_DATE']) # Create aggregated months column df['month'] = df['D_DATE'].apply(dt.date.strftime, args=('%Y.%m',)) # Group by month and sum up quantity sold by UPC_ID df = df.groupby(['month', 'UPC_ID'])['QTY_SOLD

Aggregate unique values from multiple columns with pandas GroupBy

心不动则不痛 提交于 2019-11-28 09:36:07
问题 I went into countless threads (1 2 3...) and still I don't find a solution to my problem... I have a dataframe like this: prop1 prop2 prop3 prop4 L30 3 bob 11.2 L30 54 bob 10 L30 11 john 10 L30 10 bob 10 K20 12 travis 10 K20 1 travis 4 K20 66 leo 10 I would like to do a groupby on prop1, AND at the same time, get all the other columns aggregated, but only with unique values. Like that: prop1 prop2 prop3 prop4 L30 3,54,11,10 bob,john 11.2,10 K20 12,1,66 travis,leo 10,4 I tried with different

Sample each group after pandas groupby

廉价感情. 提交于 2019-11-28 09:01:36
I know this must have been answered some where but I just could not find it. Problem : Sample each group after groupby operation. import pandas as pd df = pd.DataFrame({'a': [1,2,3,4,5,6,7], 'b': [1,1,1,0,0,0,0]}) grouped = df.groupby('b') # now sample from each group, e.g., I want 30% of each group Apply a lambda and call sample with param frac : In [2]: df = pd.DataFrame({'a': [1,2,3,4,5,6,7], 'b': [1,1,1,0,0,0,0]}) ​ grouped = df.groupby('b') grouped.apply(lambda x: x.sample(frac=0.3)) Out[2]: a b b 0 6 7 0 1 2 3 1 Sample a fraction of each group You can use GroupBy.apply with sample . You

Count each group sequentially pandas

寵の児 提交于 2019-11-28 08:40:01
问题 I have a df that I am grouping by two columns. I want to count each group sequentially. The code below counts each row within a group sequentially. This seems easier than I think but can't figure it out. df = pd.DataFrame({ 'Key': ['10003', '10009', '10009', '10009', '10009', '10034', '10034', '10034'], 'Date1': [20120506, 20120506, 20120506, 20120506, 20120620, 20120206, 20120206, 20120405], 'Date2': [20120528, 20120507, 20120615, 20120629, 20120621, 20120305, 20120506, 20120506] }) df[

pandas groupby: TOP 3 values for each group

南笙酒味 提交于 2019-11-28 06:31:55
问题 A new and more generic question has been posted in pandas groupby: TOP 3 values in each group and store in DataFrame and a working solution has been answered there. In this example I create a dataframe df with some random data spaced 5 minutes. I want to create a dataframe gdf ( grouped df ) where the 3 highest values for each hour are listed. I.e.: from this series of values VAL TIME 2017-12-08 00:00:00 29 2017-12-08 00:05:00 56 2017-12-08 00:10:00 82 2017-12-08 00:15:00 13 2017-12-08 00:20

Python (Pandas) Add subtotal on each lvl of multiindex dataframe

两盒软妹~` 提交于 2019-11-28 06:07:18
Assuming I have the following dataframe: a b c Sce1 Sce2 Sce3 Sce4 Sce5 Sc6 Animal Ground Dog 0.0 0.9 0.5 0.0 0.3 0.4 Animal Ground Cat 0.6 0.5 0.3 0.5 1.0 0.2 Animal Air Eagle 1.0 0.1 0.1 0.6 0.9 0.1 Animal Air Owl 0.3 0.1 0.5 0.3 0.5 0.9 Object Metal Car 0.3 0.3 0.8 0.6 0.5 0.6 Object Metal Bike 0.5 0.1 0.4 0.7 0.4 0.2 Object Wood Chair 0.9 0.6 0.1 0.9 0.2 0.8 Object Wood Table 0.9 0.6 0.6 0.1 0.9 0.7 I want to create a MultiIndex, which will contain the sum of each lvl. The output will look like this: a b c Sce1 Sce2 Sce3 Sce4 Sce5 Sce6 Animal 1.9 1.6 1.4 1.3 2.7 1.6 Ground 0.6 1.4 0.8 0.5

concise way of flattening multiindex columns

邮差的信 提交于 2019-11-28 04:02:01
问题 Using more than 1 function in a groupby-aggregate results in a multi-index which I then want to flatten. example: df = pd.DataFrame( {'A': [1,1,1,2,2,2,3,3,3], 'B': np.random.random(9), 'C': np.random.random(9)} ) out = df.groupby('A').agg({'B': [np.mean, np.std], 'C': np.median}) # example output B C mean std median A 1 0.791846 0.091657 0.394167 2 0.156290 0.202142 0.453871 3 0.482282 0.382391 0.892514 Currently, I do it manually like this out.columns = ['B_mean', 'B_std', 'C_median'] which

pandas group by and assign a group id then ungroup

左心房为你撑大大i 提交于 2019-11-28 01:38:50
I have a large data set in the following format: id, socialmedia 1, facebook 2, facebook 3, google 4, google 5, google 6, twitter 7, google 8, twitter 9, snapchat 10, twitter 11, facebook I want to group by then and assign a group_id column and then ungroup (expand) back to individual records. id, socialmedia, groupId 1, facebook, 1 2, facebook, 1 3, google, 2 4, google, 2 5, google, 2 6, twitter, 3 7, google, 2 8, twitter, 3 9, snapchat, 4 10, twitter, 3 11, facebook, 1 I tried following but end up with 'DataFrameGroupBy' object does not support item assignment. x['grpId'] = x.groupby(

Pandas, groupby and count

对着背影说爱祢 提交于 2019-11-28 01:27:04
I have a dataframe say like this >>> df = pd.DataFrame({'user_id':['a','a','s','s','s'], 'session':[4,5,4,5,5], 'revenue':[-1,0,1,2,1]}) >>> df revenue session user_id 0 -1 4 a 1 0 5 a 2 1 4 s 3 2 5 s 4 1 5 s And each value of session and revenue represents a kind of type, and I want to count the number of each kind say the number of revenue=-1 and session=4 of user_id=a is 1. And I found simple call count() function afer groupby() can't output the result I want. >>> df.groupby('user_id').count() revenue session user_id a 2 2 s 3 3 How can I do that? You seem to want to group by several

pandas: drop duplicates in groupby 'date'

风流意气都作罢 提交于 2019-11-28 00:43:51
问题 In the dataframe below, I would like to eliminate the duplicate cid values so the output from df.groupby('date').cid.size() matches the output from df.groupby('date').cid.nunique() . I have looked at this post but it does not seem to have a solid solution to the problem. df = pd.read_csv('https://raw.githubusercontent.com/108michael/ms_thesis/master/crsp.dime.mpl.df') df.groupby('date').cid.size() date 2005 7 2006 237 2007 3610 2008 1318 2009 2664 2010 997 2011 6390 2012 2904 2013 7875 2014