pandas-groupby | 易学教程

Convert pandas.groupby to dict

阅读更多关于 Convert pandas.groupby to dict

问题 Consider, dataframe d : d = pd.DataFrame({'a': [0, 2, 1, 1, 1, 1, 1], 'b': [2, 1, 0, 1, 0, 0, 2], 'c': [1, 0, 2, 1, 0, 2, 2]} > a b c 0 0 2 1 1 2 1 0 2 1 0 2 3 1 1 1 4 1 0 0 5 1 0 2 6 1 2 2 I want to split it by column a into dictionary like that: {0: a b c 0 0 2 1, 1: a b c 2 1 0 2 3 1 1 1 4 1 0 0 5 1 0 2 6 1 2 2, 2: a b c 1 2 1 0} The solution I've found using pandas.groupby is: {k: table for k, table in d.groupby("a")} What are the other solutions? 回答1: You can use dict with tuple / list

get first and last values in a groupby

阅读更多关于 get first and last values in a groupby

问题 I have a dataframe df df = pd.DataFrame(np.arange(20).reshape(10, -1), [['a', 'a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'd'], ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']], ['X', 'Y']) How do I get the first and last rows, grouped by the first level of the index? I tried df.groupby(level=0).agg(['first', 'last']).stack() and got X Y a first 0 1 last 6 7 b first 8 9 last 12 13 c first 14 15 last 16 17 d first 18 19 last 18 19 This is so close to what I want. How can I preserve the level 1

Pandas - dataframe groupby - how to get sum of multiple columns

阅读更多关于 Pandas - dataframe groupby - how to get sum of multiple columns

问题 This should be an easy one, but somehow I couldn't find a solution that works. I have a pandas dataframe which looks like this: index col1 col2 col3 col4 col5 0 a c 1 2 f 1 a c 1 2 f 2 a d 1 2 f 3 b d 1 2 g 4 b e 1 2 g 5 b e 1 2 g I want to group by col1 and col2 and get the sum() of col3 and col4. Col5 can be dropped, since the data can not be aggregated. Here is how the output should look like. I am interested in having both col3 and col4 in the resulting dataframe. It doesn't really matter

pandas dataframe groupby datetime month

阅读更多关于 pandas dataframe groupby datetime month

Consider a csv file: string,date,number a string,2/5/11 9:16am,1.0 a string,3/5/11 10:44pm,2.0 a string,4/22/11 12:07pm,3.0 a string,4/22/11 12:10pm,4.0 a string,4/29/11 11:59am,1.0 a string,5/2/11 1:41pm,2.0 a string,5/2/11 2:02pm,3.0 a string,5/2/11 2:56pm,4.0 a string,5/2/11 3:00pm,5.0 a string,5/2/14 3:02pm,6.0 a string,5/2/14 3:18pm,7.0 I can read this in, and reformat the date column into datetime format: b=pd.read_csv('b.dat') b['date']=pd.to_datetime(b['date'],format='%m/%d/%y %I:%M%p') I have been trying to group the data by month. It seems like there should be an obvious way of

pandas group by year, rank by sales column, in a dataframe with duplicate data

阅读更多关于 pandas group by year, rank by sales column, in a dataframe with duplicate data

问题 I would like to create a rank on year (so in year 2012, Manager B is 1. In 2011, Manager B is 1 again). I struggled with the pandas rank function for awhile and DO NOT want to resort to a for loop. s = pd.DataFrame([['2012','A',3],['2012','B',8],['2011','A',20],['2011','B',30]], columns=['Year','Manager','Return']) Out[1]: Year Manager Return 0 2012 A 3 1 2012 B 8 2 2011 A 20 3 2011 B 30 The issue I'm having is with the additional code (didn't think this would be relevant before): s = pd

How to move pandas data from index to column after multiple groupby

阅读更多关于 How to move pandas data from index to column after multiple groupby

I have the following pandas dataframe: dfalph.head() token year uses books 386 xanthos 1830 3 3 387 xanthos 1840 1 1 388 xanthos 1840 2 2 389 xanthos 1868 2 2 390 xanthos 1875 1 1 I aggregate the rows with duplicate token and years like so: dfalph = dfalph[['token','year','uses','books']].groupby(['token', 'year']).agg([np.sum]) dfalph.columns = dfalph.columns.droplevel(1) dfalph.head() uses books token year xanthos 1830 3 3 1840 3 3 1867 2 2 1868 2 2 1875 1 1 Instead of having the 'token' and 'year' fields in the index, I would like to return them to columns and have an integer index. Method

Best Way to add group totals to a dataframe in Pandas

阅读更多关于 Best Way to add group totals to a dataframe in Pandas

问题 I have a simple task that I'm wondering if there is a better / more efficient way to do. I have a dataframe that looks like this: Group Score Count 0 A 5 100 1 A 1 50 2 A 3 5 3 B 1 40 4 B 2 20 5 B 1 60 And I want to add a column that holds the value of the group total count: Group Score Count TotalCount 0 A 5 100 155 1 A 1 50 155 2 A 3 5 155 3 B 1 40 120 4 B 2 20 120 5 B 1 60 120 The way I did this was: Grouped=df.groupby('Group')['Count'].sum().reset_index() Grouped=Grouped.rename(columns={

Python Pandas Conditional Sum with Groupby

阅读更多关于 Python Pandas Conditional Sum with Groupby

问题 Using sample data: df = pd.DataFrame({'key1' : ['a','a','b','b','a'], 'key2' : ['one', 'two', 'one', 'two', 'one'], 'data1' : np.random.randn(5), 'data2' : np. random.randn(5)}) df data1 data2 key1 key2 0 0.361601 0.375297 a one 1 0.069889 0.809772 a two 2 1.468194 0.272929 b one 3 -1.138458 0.865060 b two 4 -0.268210 1.250340 a one I'm trying to figure out how to group the data by key1 and sum only the data1 values where key2 equals 'one'. Here's what I've tried def f(d,a,b): d.ix[d[a] == b,

pandas groupby where you get the max of one column and the min of another column

阅读更多关于 pandas groupby where you get the max of one column and the min of another column

问题 I have a dataframe as follows: user num1 num2 a 1 1 a 2 2 a 3 3 b 4 4 b 5 5 I want a dataframe which has the minimum from num1 for each user, and the maximum of num2 for each user. The output should be like: user num1 num2 a 1 3 b 4 5 I know that if I wanted the max of both columns I could just do: a.groupby('user')['num1', 'num2'].max() Is there some equivalent without having to do something like: series_1 = a.groupby('user')['num1'].min() series_2 = a.groupby('user')['num2'].max() #

Count unique values with pandas per groups [duplicate]

阅读更多关于 Count unique values with pandas per groups [duplicate]

This question already has an answer here: Pandas count(distinct) equivalent 5 answers I need to count unique ID values in every domain I have data ID, domain 123, 'vk.com' 123, 'vk.com' 123, 'twitter.com' 456, 'vk.com' 456, 'facebook.com' 456, 'vk.com' 456, 'google.com' 789, 'twitter.com' 789, 'vk.com' I try df.groupby(['domain', 'ID']).count() But I want to get domain, count vk.com 3 twitter.com 2 facebook.com 1 google.com 1 jezrael You need nunique : df = df.groupby('domain')['ID'].nunique() print (df) domain 'facebook.com' 1 'google.com' 1 'twitter.com' 2 'vk.com' 3 Name: ID, dtype: int64