pandas-groupby | 易学教程

When is it appropriate to use df.value_counts() vs df.groupby('…').count()?

阅读更多关于 When is it appropriate to use df.value_counts() vs df.groupby('…').count()?

问题 I\'ve heard in Pandas there\'s often multiple ways to do the same thing, but I was wondering – If I\'m trying to group data by a value within a specific column and count the number of items with that value, when does it make sense to use df.groupby(\'colA\').count() and when does it make sense to use df[\'colA\'].value_counts() ? 回答1: There is difference value_counts return: The resulting object will be in descending order so that the first element is the most frequently-occurring element.

Pandas, groupby and count

阅读更多关于 Pandas, groupby and count

问题 I have a dataframe say like this >>> df = pd.DataFrame({\'user_id\':[\'a\',\'a\',\'s\',\'s\',\'s\'], \'session\':[4,5,4,5,5], \'revenue\':[-1,0,1,2,1]}) >>> df revenue session user_id 0 -1 4 a 1 0 5 a 2 1 4 s 3 2 5 s 4 1 5 s And each value of session and revenue represents a kind of type, and I want to count the number of each kind say the number of revenue=-1 and session=4 of user_id=a is 1. And I found simple call count() function afer groupby() can\'t output the result I want. >>> df

Multiple aggregations of the same column using pandas GroupBy.agg()

阅读更多关于 Multiple aggregations of the same column using pandas GroupBy.agg()

Given the following (totally overkill) data frame example import pandas as pd import datetime as dt df = pd.DataFrame({ "date" : [dt.date(2012, x, 1) for x in range(1, 11)], "returns" : 0.05 * np.random.randn(10), "dummy" : np.repeat(1, 10) }) is there an existing built-in way to apply two different aggregating functions to the same column, without having to call agg multiple times? The syntactically wrong, but intuitively right, way to do it would be: # Assume `function1` and `function2` are defined for aggregating. df.groupby("dummy").agg({"returns":function1, "returns":function2}) Obviously

group by pandas dataframe and select latest in each group

阅读更多关于 group by pandas dataframe and select latest in each group

问题 How to group values of pandas dataframe and select the latest(by date) from each group? For example, given a dataframe sorted by date: id product date 0 220 6647 2014-09-01 1 220 6647 2014-09-03 2 220 6647 2014-10-16 3 826 3380 2014-11-11 4 826 3380 2014-12-09 5 826 3380 2015-05-19 6 901 4555 2014-09-01 7 901 4555 2014-10-05 8 901 4555 2014-11-01 grouping by id or product, and selecting the earliest gives: id product date 2 220 6647 2014-10-16 5 826 3380 2015-05-19 8 901 4555 2014-11-01 回答1:

Aggregation in pandas

阅读更多关于 Aggregation in pandas

How to perform aggregation with pandas? No DataFrame after aggregation! What happened? How to aggregate mainly strings columns (to list s, tuple s, strings with separator )? How to aggregate counts? How to create new column filled by aggregated values? I've seen these recurring questions asking about various faces of the pandas aggregate functionality. Most of the information regarding aggregation and its various use cases today is fragmented across dozens of badly worded, unsearchable posts. The aim here is to collate some of the more important points for posterity. This Q/A is meant to be

why does pandas rolling use single dimension ndarray

阅读更多关于 why does pandas rolling use single dimension ndarray

问题 I was motivated to use pandas rolling feature to perform a rolling multi-factor regression (This question is NOT about rolling multi-factor regression). I expected that I\'d be able to use apply after a df.rolling(2) and take the resulting pd.DataFrame extract the ndarray with .values and perform the requisite matrix multiplication. It didn\'t work out that way. Here is what I found: import pandas as pd import numpy as np np.random.seed([3,1415]) df = pd.DataFrame(np.random.rand(5, 2).round(2

Groupby value counts on the dataframe pandas

阅读更多关于 Groupby value counts on the dataframe pandas

问题 I have the following dataframe: df = pd.DataFrame([ (1, 1, \'term1\'), (1, 2, \'term2\'), (1, 1, \'term1\'), (1, 1, \'term2\'), (2, 2, \'term3\'), (2, 3, \'term1\'), (2, 2, \'term1\') ], columns=[\'id\', \'group\', \'term\']) I want to group it by id and group and calculate the number of each term for this id, group pair. So in the end I am going to get something like this: I was able to achieve what I want by looping over all the rows with df.iterrows() and creating a new dataframe, but this

How to move pandas data from index to column after multiple groupby

阅读更多关于 How to move pandas data from index to column after multiple groupby

问题 I have the following pandas dataframe: dfalph.head() token year uses books 386 xanthos 1830 3 3 387 xanthos 1840 1 1 388 xanthos 1840 2 2 389 xanthos 1868 2 2 390 xanthos 1875 1 1 I aggregate the rows with duplicate token and years like so: dfalph = dfalph[[\'token\',\'year\',\'uses\',\'books\']].groupby([\'token\', \'year\']).agg([np.sum]) dfalph.columns = dfalph.columns.droplevel(1) dfalph.head() uses books token year xanthos 1830 3 3 1840 3 3 1867 2 2 1868 2 2 1875 1 1 Instead of having

Keep other columns when doing groupby

阅读更多关于 Keep other columns when doing groupby

问题 I\'m using groupby on a pandas dataframe to drop all rows that don\'t have the minimum of a specific column. Something like this: df1 = df.groupby(\"item\", as_index=False)[\"diff\"].min() However, if I have more than those two columns, the other columns (e.g. otherstuff in my example) get dropped. Can I keep those columns using groupby , or am I going to have to find a different way to drop the rows? My data looks like: item diff otherstuff 0 1 2 1 1 1 1 2 2 1 3 7 3 2 -1 0 4 2 1 3 5 2 4 9 6

Count unique values with pandas per groups [duplicate]

阅读更多关于 Count unique values with pandas per groups [duplicate]

问题 This question already has an answer here: Pandas count(distinct) equivalent 6 answers I need to count unique ID values in every domain I have data ID, domain 123, \'vk.com\' 123, \'vk.com\' 123, \'twitter.com\' 456, \'vk.com\' 456, \'facebook.com\' 456, \'vk.com\' 456, \'google.com\' 789, \'twitter.com\' 789, \'vk.com\' I try df.groupby([\'domain\', \'ID\']).count() But I want to get domain, count vk.com 3 twitter.com 2 facebook.com 1 google.com 1 回答1: You need nunique: df = df.groupby(