pandas-groupby | 易学教程

df.groupby(…).agg(set) produces different result compared to df.groupby(…).agg(lambda x: set(x))

阅读更多关于 df.groupby(…).agg(set) produces different result compared to df.groupby(…).agg(lambda x: set(x))

Answering this question it turned out that df.groupby(...).agg(set) and df.groupby(...).agg(lambda x: set(x)) are producing different results. Data: df = pd.DataFrame({ 'user_id': [1, 2, 3, 4, 1, 2, 3], 'class_type': ['Krav Maga', 'Yoga', 'Ju-jitsu', 'Krav Maga', 'Ju-jitsu','Krav Maga', 'Karate'], 'instructor': ['Bob', 'Alice','Bob', 'Alice','Alice', 'Alice','Bob']}) Demo: In [36]: df.groupby('user_id').agg(lambda x: set(x)) Out[36]: class_type instructor user_id 1 {Krav Maga, Ju-jitsu} {Alice, Bob} 2 {Yoga, Krav Maga} {Alice} 3 {Ju-jitsu, Karate} {Bob} 4 {Krav Maga} {Alice} In [37]: df

Looping through an Excel spreadsheet (using openpyxl)

阅读更多关于 Looping through an Excel spreadsheet (using openpyxl)

问题 import openpyxl wb=openpyxl.load_workbook('Book_1.xlsx') ws=wb['Sheet_1'] I am trying to analyze an excel spreadsheet using openpyxl. My goal is to get the max number from column D for each group of numbers in column A. I would like help in getting a code to loop for the analysis. Here is an example of the spreadsheet that I am trying to analyze. The file name is Book 1 and the sheet name is Sheet 1. I am running Python 3.6.1, pandas 0.20.1, and openpyxl 2.4.7. I am providing the code I have

pandas: GroupBy .pipe() vs .apply()

阅读更多关于 pandas: GroupBy .pipe() vs .apply()

In the example from the pandas documentation about the new .pipe() method for GroupBy objects, an .apply() method accepting the same lambda would return the same results. In [195]: import numpy as np In [196]: n = 1000 In [197]: df = pd.DataFrame({'Store': np.random.choice(['Store_1', 'Store_2'], n), .....: 'Product': np.random.choice(['Product_1', 'Product_2', 'Product_3'], n), .....: 'Revenue': (np.random.random(n)*50+10).round(2), .....: 'Quantity': np.random.randint(1, 10, size=n)}) In [199]: (df.groupby(['Store', 'Product']) .....: .pipe(lambda grp: grp.Revenue.sum()/grp.Quantity.sum()) .

Groupby filter, based on consecutive sequence sorted and ID and Date column

阅读更多关于 Groupby filter, based on consecutive sequence sorted and ID and Date column

问题 I have a dataframe as shown below ID Status Date 0 1 F 2017-06-22 1 1 M 2017-07-22 2 1 P 2017-10-22 3 1 F 2018-06-22 4 1 P 2018-08-22 5 1 F 2018-10-22 6 1 F 2019-03-22 7 2 M 2017-06-29 8 2 F 2017-09-29 9 2 F 2018-01-29 10 2 M 2018-03-29 11 2 P 2018-08-29 12 2 M 2018-10-29 13 2 F 2018-12-29 14 3 M 2017-03-20 15 3 F 2018-06-20 16 3 P 2018-08-20 17 3 M 2018-10-20 18 3 F 2018-11-20 19 3 P 2018-12-20 20 3 F 2019-03-20 22 4 M 2017-08-10 23 4 F 2018-06-10 24 4 P 2018-08-10 25 4 F 2018-12-10 26 4 M

filter dataframe after groupby and nunique in pandas

阅读更多关于 filter dataframe after groupby and nunique in pandas

问题 i tried df.groupby("item")["variable"].nunique() and it returns a unique count of every item object. i want to filter to only return the count of "variable" > 3 conditional on Groupby item... is there a method? 回答1: When you want the groupby to be mapped to every row of the input, think about transform : df = df[df.groupby("item")["variable"].transform('nunique') > 3] 来源： https://stackoverflow.com/questions/53551777/filter-dataframe-after-groupby-and-nunique-in-pandas

How to reset cumsum after change in sign of values?

阅读更多关于 How to reset cumsum after change in sign of values?

问题 In [46]: d = np.random.randn(10, 1) * 2 In [47]: df = pd.DataFrame(d.astype(int), columns=['data']) I am trying to create a cumsum column where it should reset after a sign change in data column, like this data custom_cumsum 0 -2 -2 1 -1 -3 2 1 1 3 -3 -3 4 -1 -4 5 2 2 6 0 2 7 3 5 8 -1 -1 9 -2 -3 I am able to achieve this with df.iterrows() . I am trying to avoid iterrows and do it with vector operations. There are couple of questions on resetting cumsum when there is NaN. I am not able to

Looping through an Excel spreadsheet (using openpyxl)

阅读更多关于 Looping through an Excel spreadsheet (using openpyxl)

import openpyxl wb=openpyxl.load_workbook('Book_1.xlsx') ws=wb['Sheet_1'] I am trying to analyze an excel spreadsheet using openpyxl. My goal is to get the max number from column D for each group of numbers in column A. I would like help in getting a code to loop for the analysis. Here is an example of the spreadsheet that I am trying to analyze. The file name is Book 1 and the sheet name is Sheet 1. I am running Python 3.6.1, pandas 0.20.1, and openpyxl 2.4.7. I am providing the code I have so far. IIUC, use pandas module to achieve this: import pandas as pd df = pd.read_excel('yourfile.xlsx'

filter dataframe after groupby and nunique in pandas

阅读更多关于 filter dataframe after groupby and nunique in pandas

i tried df.groupby("item")["variable"].nunique() and it returns a unique count of every item object. i want to filter to only return the count of "variable" > 3 conditional on Groupby item... is there a method? When you want the groupby to be mapped to every row of the input, think about transform : df = df[df.groupby("item")["variable"].transform('nunique') > 3] 来源： https://stackoverflow.com/questions/53551777/filter-dataframe-after-groupby-and-nunique-in-pandas

Iterating over groups in a dataframe [duplicate]

阅读更多关于 Iterating over groups in a dataframe [duplicate]

This question already has an answer here: Looping over groups in a grouped dataframe 2 answers The issue I am having is that I want to group the dataframe and then use functions to manipulate the data after its been grouped. For example I want to group the data by Date and then iterate through each row in the date groups to parse to a function? The issue is groupby seems to create a tuple of the key and then a massive string consisting of all of the rows in the data making iterating through each row impossible When you apply groupby on a dataframe, you don't get rows, you get groups of

How to reset cumsum after change in sign of values?

阅读更多关于 How to reset cumsum after change in sign of values?

In [46]: d = np.random.randn(10, 1) * 2 In [47]: df = pd.DataFrame(d.astype(int), columns=['data']) I am trying to create a cumsum column where it should reset after a sign change in data column, like this data custom_cumsum 0 -2 -2 1 -1 -3 2 1 1 3 -3 -3 4 -1 -4 5 2 2 6 0 2 7 3 5 8 -1 -1 9 -2 -3 I am able to achieve this with df.iterrows() . I am trying to avoid iterrows and do it with vector operations. There are couple of questions on resetting cumsum when there is NaN. I am not able to achieve this cumsum with those solutions. Create new key to groupby , then do cumsum within each group New