pandas-groupby

Python Pandas max value in a group as a new column

半世苍凉 提交于 2019-11-27 07:44:19
问题 I am trying to calculate a new column which contains maximum values for each of several groups. I'm coming from a Stata background so I know the Stata code would be something like this: by group, sort: egen max = max(odds) For example: data = {'group' : ['A', 'A', 'B','B'], 'odds' : [85, 75, 60, 65]} Then I would like it to look like: group odds max A 85 85 A 75 85 B 60 65 B 65 65 Eventually I am trying to form a column that takes 1/(max-min) * odds where max and min are for each group. 回答1:

Looping over groups in a grouped dataframe

不打扰是莪最后的温柔 提交于 2019-11-27 07:01:23
问题 Consider this small example: data={"X":[1, 2, 3, 4, 5], "Y":[6, 7, 8, 9, 10], "Z": [11, 12, 13, 14, 15]) frame=pd.DataFrame(data,columns=["X","Y","Z"],index=["A","A","A","B","B"]) I want to group frame with grouped=frame.groupby(frame.index) Then I want to loop over the groups by: for group in grouped: But I'm stuck on the next step: How can I extract the group in each loop as a pandas DataFrame so I can further process it? 回答1: df.groupby returns a list of 2-tuples: the index, and the group.

Combine duplicated columns within a DataFrame

自闭症网瘾萝莉.ら 提交于 2019-11-27 05:45:14
问题 If I have a dataframe that has columns that include the same name, is there a way to combine the columns that have the same name with some sort of function (i.e. sum)? For instance with: In [186]: df["NY-WEB01"].head() Out[186]: NY-WEB01 NY-WEB01 DateTime 2012-10-18 16:00:00 5.6 2.8 2012-10-18 17:00:00 18.6 12.0 2012-10-18 18:00:00 18.4 12.0 2012-10-18 19:00:00 18.2 12.0 2012-10-18 20:00:00 19.2 12.0 How might I collapse the NY-WEB01 columns (there are a bunch of duplicate columns, not just

What is the equivalent of SQL “GROUP BY HAVING” on Pandas?

ε祈祈猫儿з 提交于 2019-11-27 05:42:47
问题 what would be the most efficient way to use groupby and in parallel apply a filter in pandas? Basically I am asking for the equivalent in SQL of select * ... group by col_name having condition I think there are many uses cases ranging from conditional means, sums, conditional probabilities, etc. which would make such a command very powerful. I need a very good performance, so ideally such a command would not be the result of several layered operations done in python. 回答1: As mentioned in

group by pandas dataframe and select latest in each group

不打扰是莪最后的温柔 提交于 2019-11-27 04:12:18
How to group values of pandas dataframe and select the latest(by date) from each group? For example, given a dataframe sorted by date: id product date 0 220 6647 2014-09-01 1 220 6647 2014-09-03 2 220 6647 2014-10-16 3 826 3380 2014-11-11 4 826 3380 2014-12-09 5 826 3380 2015-05-19 6 901 4555 2014-09-01 7 901 4555 2014-10-05 8 901 4555 2014-11-01 grouping by id or product, and selecting the earliest gives: id product date 2 220 6647 2014-10-16 5 826 3380 2015-05-19 8 901 4555 2014-11-01 use idxmax in groupby and slice df with loc df.loc[df.groupby('id').date.idxmax()] id product date 2 220

How to summarize on different groupby combinations?

自作多情 提交于 2019-11-27 04:07:42
问题 I am compiling a table of top-3 crops by county. Some counties have the same crop varieties in the same order. Other counties have the same crop varieties in a different order. df1 = pd.DataFrame( { "County" : ["Harney", "Baker", "Wheeler", "Hood River", "Wasco" , "Morrow","Union","Lake"] , "Crop1" : ["grain", "melons", "melons", "apples", "pears", "raddish","pears","pears"], "Crop2" : ["melons","grain","grain","melons","carrots","pears","carrots","carrots"], "Crop3": ["apples","apples",

What is the difference between pandas agg and apply function?

我怕爱的太早我们不能终老 提交于 2019-11-27 04:04:31
I can't figure out the difference between Pandas .aggregate and .apply functions. Take the following as an example: I load a dataset, do a groupby , define a simple function, and either user .agg or .apply . As you may see, the printing statement within my function results in the same output after using .agg and .apply . The result, on the other hand is different. Why is that? import pandas import pandas as pd iris = pd.read_csv('iris.csv') by_species = iris.groupby('Species') def f(x): ...: print type(x) ...: print x.head(3) ...: return 1 Using apply : by_species.apply(f) #<class 'pandas.core

Pandas groupby and aggregation output should include all the original columns (including the ones not aggregated on)

大城市里の小女人 提交于 2019-11-27 02:41:44
问题 I have the following data frame and want to: Group records by month Sum QTY_SOLD and NET_AMT of each unique UPC_ID (per month) Include the rest of the columns as well in the resulting dataframe The way I thought I can do this is 1st: create a month column to aggregate the D_DATES , then sum QTY_SOLD by UPC_ID . Script: # Convert date to date time object df['D_DATE'] = pd.to_datetime(df['D_DATE']) # Create aggregated months column df['month'] = df['D_DATE'].apply(dt.date.strftime, args=('%Y.%m

Select the max row per group - pandas performance issue

我只是一个虾纸丫 提交于 2019-11-26 23:10:50
问题 I'm selecting one max row per group and I'm using groupby / agg to return index values and select the rows using loc . For example, to group by "Id" and then select the row with the highest "delta" value: selected_idx = df.groupby("Id").apply(lambda df: df.delta.argmax()) selected_rows = df.loc[selected_idx, :] However, it's so slow this way. Actually, my i7/16G RAM laptop hangs when I'm using this query on 13 million rows. I have two questions for experts: How can I make this query run fast

Splitting a dataframe into separate CSV files

落花浮王杯 提交于 2019-11-26 22:26:56
问题 I have a fairly large csv, looking like this: +---------+---------+ | Column1 | Column2 | +---------+---------+ | 1 | 93644 | | 2 | 63246 | | 3 | 47790 | | 3 | 39644 | | 3 | 32585 | | 1 | 19593 | | 1 | 12707 | | 2 | 53480 | +---------+---------+ My intent is to Add a new column Insert a specific value into that column, 'NewColumnValue', on each row of the csv Sort the file based on the value in Column1 Split the original CSV into new files based on the contents of 'Column1', removing the