pandas-groupby | 易学教程

Sort Values in DataFrame using Categorical Key without groupby Split Apply Combine

阅读更多关于 Sort Values in DataFrame using Categorical Key without groupby Split Apply Combine

问题 So... I have a Dataframe that looks like this, but much larger: DATE ITEM STORE STOCK 0 2018-06-06 A L001 4 1 2018-06-06 A L002 0 2 2018-06-06 A L003 4 3 2018-06-06 B L001 1 4 2018-06-06 B L002 2 You can reproduce the same DataFrame with the following code: import pandas as pd import numpy as np import itertools as it lojas = ['L001', 'L002', 'L003'] itens = list("ABC") dr = pd.date_range(start='2018-06-06', end='2018-06-12') df = pd.DataFrame(data=list(it.product(dr, itens, lojas)), columns=

Pandas groupby value_count filter by frequency

阅读更多关于 Pandas groupby value_count filter by frequency

问题 I would like to filter out the frequencies that are less than n, in my case n is 2 df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'bar',],'B' : ['yes', 'no', 'yes', 'no', 'no', 'yes','yes', 'no', 'no', 'no']}) df.groupby('A')['B'].value_counts() A B bar no 4 yes 1 foo yes 3 no 2 Name: B, dtype: int64 Ideally I would like the results in a dataframe showing the below(frequency of 1 is not excluded) A B freq bar no 4 foo yes 3 foo no 2 I have tried df

Pandas DataFrame - Aggregate on column whos dtype=='category' leads to slow performance

阅读更多关于 Pandas DataFrame - Aggregate on column whos dtype=='category' leads to slow performance

问题 I work with big dataframes with high memory usage and I read that if I change the dtype on repeated values columns I can save big amount of memory. I tried it and indeed it dropped the memory usage by 25% but then I bumped into a performance slowness which I could not understand. I do group-by aggregation on the dtype 'category' columns and before I changed the dtype it took about 1 second and after the change it took about 1 minute. This code demonstrates the performance degradation by

Custom sort order function for groupby pandas python

阅读更多关于 Custom sort order function for groupby pandas python

问题 Let's say I have a grouped dataframe like the below (which was obtained through an initial df.groupby(df["A"]).apply(some_func) where some_func returns a dataframe itself). The second column is the second level of the multiindex which was created by the groupby . A B C 1 0 1 8 1 3 3 2 0 1 2 1 2 2 3 0 1 3 1 2 4 And I would like to order on the result of a custom function that I apply to the groups. Let's assume for this example that the function is def my_func(group): return sum(group["B"]

Pandas Groupby TimeGrouper and apply

阅读更多关于 Pandas Groupby TimeGrouper and apply

问题 As per this question. This groupby works when applied to my df for a pd.rolling_mean column as follows: data['maFast']=data['Last'].groupby(pd.TimeGrouper('d')) .apply(pd.rolling_mean,center=False,win‌dow=10) How do I apply the same groupby logic to another element of my df which contains pd.rolling_std and pd.rolling_mean : data['maSlow_std'] = pd.rolling_mean(data['Last'], window=60) + 2* pd.rolling_std(data['Last'], 20, min_periods=20) 回答1: I think you need function lambda : data['maSlow

Groupby two columns ignoring order of pairs

阅读更多关于 Groupby two columns ignoring order of pairs

问题 Suppose we have a dataframe that looks like this: start stop duration 0 A B 1 1 B A 2 2 C D 2 3 D C 0 What's the best way to construct a list of: i) start/stop pairs; ii) count of start/stop pairs; iii) avg duration of start/stop pairs? In this case, order should not matter: (A,B)=(B,A) . Desired output: [[start,stop,count,avg duration]] In this example: [[A,B,2,1.5],[C,D,2,1]] 回答1: sort the first two columns (you can do this in-place, or create a copy and do the same thing; I've done the

Groupby two columns ignoring order of pairs

阅读更多关于 Groupby two columns ignoring order of pairs

Insert rows as a result of a groupby operation into the original dataframe

阅读更多关于 Insert rows as a result of a groupby operation into the original dataframe

问题 For example, I have a pandas dataframe as follows: col_1 col_2 col_3 col_4 a X 5 1 a Y 3 2 a Z 6 4 b X 7 8 b Y 4 3 b Z 6 5 And I want to, for each value in col_1, add the values in col_3 and col_4 (and many more columns) that correspond to X and Z from col_2 and create a new row with these values. So the output would be as below: col_1 col_2 col_3 col_4 a X 5 1 a Y 3 2 a Z 6 4 a NEW 11 5 b X 7 8 b Y 4 3 b Z 6 5 b NEW 13 13 Also, there could be more values in col_1 that will need the same

Pandas enumerate groups in descending order

阅读更多关于 Pandas enumerate groups in descending order

问题 I've the following column: column 0 10 1 10 2 8 3 8 4 6 5 6 My goal is to find the today unique values (3 in this case) and create a new column which would create the following new_column 0 3 1 3 2 2 3 2 4 1 5 1 The numbering starts from length of unique values (3) and same number is repeated if current row is same as previous row based on original column. Number gets decreased as row value changes. All unique values in original column have same number of rows (2 rows for each unique value in

Reshape pandas dataframe from rows to columns

阅读更多关于 Reshape pandas dataframe from rows to columns

问题 I'm trying to reshape my data. At first glance, it sounds like a transpose, but it's not. I tried melts, stack/unstack, joins, etc. Use Case I want to have only one row per unique individual, and put all job history on the columns. For clients, it can be easier to read information across rows rather than reading through columns. Here's the data: import pandas as pd import numpy as np data1 = {'Name': ["Joe", "Joe", "Joe","Jane","Jane"], 'Job': ["Analyst","Manager","Director","Analyst",