pandas-groupby | 易学教程

Python Pandas max value in a group as a new column

阅读更多关于 Python Pandas max value in a group as a new column

问题 I am trying to calculate a new column which contains maximum values for each of several groups. I'm coming from a Stata background so I know the Stata code would be something like this: by group, sort: egen max = max(odds) For example: data = {'group' : ['A', 'A', 'B','B'], 'odds' : [85, 75, 60, 65]} Then I would like it to look like: group odds max A 85 85 A 75 85 B 60 65 B 65 65 Eventually I am trying to form a column that takes 1/(max-min) * odds where max and min are for each group. 回答1:

Looping over groups in a grouped dataframe

阅读更多关于 Looping over groups in a grouped dataframe

问题 Consider this small example: data={"X":[1, 2, 3, 4, 5], "Y":[6, 7, 8, 9, 10], "Z": [11, 12, 13, 14, 15]) frame=pd.DataFrame(data,columns=["X","Y","Z"],index=["A","A","A","B","B"]) I want to group frame with grouped=frame.groupby(frame.index) Then I want to loop over the groups by: for group in grouped: But I'm stuck on the next step: How can I extract the group in each loop as a pandas DataFrame so I can further process it? 回答1: df.groupby returns a list of 2-tuples: the index, and the group.

Combine duplicated columns within a DataFrame

阅读更多关于 Combine duplicated columns within a DataFrame

问题 If I have a dataframe that has columns that include the same name, is there a way to combine the columns that have the same name with some sort of function (i.e. sum)? For instance with: In [186]: df["NY-WEB01"].head() Out[186]: NY-WEB01 NY-WEB01 DateTime 2012-10-18 16:00:00 5.6 2.8 2012-10-18 17:00:00 18.6 12.0 2012-10-18 18:00:00 18.4 12.0 2012-10-18 19:00:00 18.2 12.0 2012-10-18 20:00:00 19.2 12.0 How might I collapse the NY-WEB01 columns (there are a bunch of duplicate columns, not just

What is the equivalent of SQL “GROUP BY HAVING” on Pandas?

阅读更多关于 What is the equivalent of SQL “GROUP BY HAVING” on Pandas?

问题 what would be the most efficient way to use groupby and in parallel apply a filter in pandas? Basically I am asking for the equivalent in SQL of select * ... group by col_name having condition I think there are many uses cases ranging from conditional means, sums, conditional probabilities, etc. which would make such a command very powerful. I need a very good performance, so ideally such a command would not be the result of several layered operations done in python. 回答1: As mentioned in

group by pandas dataframe and select latest in each group

阅读更多关于 group by pandas dataframe and select latest in each group

How to group values of pandas dataframe and select the latest(by date) from each group? For example, given a dataframe sorted by date: id product date 0 220 6647 2014-09-01 1 220 6647 2014-09-03 2 220 6647 2014-10-16 3 826 3380 2014-11-11 4 826 3380 2014-12-09 5 826 3380 2015-05-19 6 901 4555 2014-09-01 7 901 4555 2014-10-05 8 901 4555 2014-11-01 grouping by id or product, and selecting the earliest gives: id product date 2 220 6647 2014-10-16 5 826 3380 2015-05-19 8 901 4555 2014-11-01 use idxmax in groupby and slice df with loc df.loc[df.groupby('id').date.idxmax()] id product date 2 220

How to summarize on different groupby combinations?

阅读更多关于 How to summarize on different groupby combinations?

问题 I am compiling a table of top-3 crops by county. Some counties have the same crop varieties in the same order. Other counties have the same crop varieties in a different order. df1 = pd.DataFrame( { "County" : ["Harney", "Baker", "Wheeler", "Hood River", "Wasco" , "Morrow","Union","Lake"] , "Crop1" : ["grain", "melons", "melons", "apples", "pears", "raddish","pears","pears"], "Crop2" : ["melons","grain","grain","melons","carrots","pears","carrots","carrots"], "Crop3": ["apples","apples",

What is the difference between pandas agg and apply function?

阅读更多关于 What is the difference between pandas agg and apply function?

I can't figure out the difference between Pandas .aggregate and .apply functions. Take the following as an example: I load a dataset, do a groupby , define a simple function, and either user .agg or .apply . As you may see, the printing statement within my function results in the same output after using .agg and .apply . The result, on the other hand is different. Why is that? import pandas import pandas as pd iris = pd.read_csv('iris.csv') by_species = iris.groupby('Species') def f(x): ...: print type(x) ...: print x.head(3) ...: return 1 Using apply : by_species.apply(f) #<class 'pandas.core

Pandas groupby and aggregation output should include all the original columns (including the ones not aggregated on)

阅读更多关于 Pandas groupby and aggregation output should include all the original columns (including the ones not aggregated on)

问题 I have the following data frame and want to: Group records by month Sum QTY_SOLD and NET_AMT of each unique UPC_ID (per month) Include the rest of the columns as well in the resulting dataframe The way I thought I can do this is 1st: create a month column to aggregate the D_DATES , then sum QTY_SOLD by UPC_ID . Script: # Convert date to date time object df['D_DATE'] = pd.to_datetime(df['D_DATE']) # Create aggregated months column df['month'] = df['D_DATE'].apply(dt.date.strftime, args=('%Y.%m

Select the max row per group - pandas performance issue

阅读更多关于 Select the max row per group - pandas performance issue

问题 I'm selecting one max row per group and I'm using groupby / agg to return index values and select the rows using loc . For example, to group by "Id" and then select the row with the highest "delta" value: selected_idx = df.groupby("Id").apply(lambda df: df.delta.argmax()) selected_rows = df.loc[selected_idx, :] However, it's so slow this way. Actually, my i7/16G RAM laptop hangs when I'm using this query on 13 million rows. I have two questions for experts: How can I make this query run fast

Splitting a dataframe into separate CSV files

阅读更多关于 Splitting a dataframe into separate CSV files

问题 I have a fairly large csv, looking like this: +---------+---------+ | Column1 | Column2 | +---------+---------+ | 1 | 93644 | | 2 | 63246 | | 3 | 47790 | | 3 | 39644 | | 3 | 32585 | | 1 | 19593 | | 1 | 12707 | | 2 | 53480 | +---------+---------+ My intent is to Add a new column Insert a specific value into that column, 'NewColumnValue', on each row of the csv Sort the file based on the value in Column1 Split the original CSV into new files based on the contents of 'Column1', removing the