pandas-groupby | 易学教程

Pandas groupby/apply has different behaviour with int and string types

阅读更多关于 Pandas groupby/apply has different behaviour with int and string types

问题 I have the following dataframe X Y 0 A 10 1 A 9 2 A 8 3 A 5 4 B 100 5 B 90 6 B 80 7 B 50 and two different functions that are very similar def func1(x): if x.iloc[0]['X'] == 'A': x['D'] = 1 else: x['D'] = 0 return x[['X', 'D']] def func2(x): if x.iloc[0]['X'] == 'A': x['D'] = 'u' else: x['D'] = 'v' return x[['X', 'D']] Now I can groupby/apply these functions df.groupby('X').apply(func1) df.groupby('X').apply(func2) The first line gives me what I want, i.e. X D 0 A 1 1 A 1 2 A 1 3 A 1 4 B 0 5

Pandas Reindex to Fill Missing Dates, or Better Method to Fill?

阅读更多关于 Pandas Reindex to Fill Missing Dates, or Better Method to Fill?

问题 My data is absence records from a factory. Some days there are no absences so there is no data or date recorded for that day. However, and where this gets hairy with the other examples shown, is on any given day there can be several absences for various reasons. There is not always a 1 to 1 ratio of date-to-record in the data. The result I'm hoping for is something like this: (index) Shift Description Instances (SUM) 01-01-14 2nd Baker Discipline 0 01-01-14 2nd Baker Vacation 0 01-01-14 1st

pandas groupby dropping columns

阅读更多关于 pandas groupby dropping columns

问题 I'm doing a simple group by operation, trying to compare group means. As you can see below, I have selected specific columns from a larger dataframe, from which all missing values have been removed. But when I group by, I am losing a couple of columns: I have never encountered this with pandas, and I'm not finding anything else on stack overflow that is all that similar. Does anybody have any insight? 回答1: I think it is Automatic exclusion of 'nuisance' columns , what described here. Sample:

Pandas groupby with categories with redundant nan

阅读更多关于 Pandas groupby with categories with redundant nan

问题 I am having issues using pandas groupby with categorical data. Theoretically, it should be super efficient: you are grouping and indexing via integers rather than strings. But it insists that, when grouping by multiple categories, every combination of categories must be accounted for. I sometimes use categories even when there's a low density of common strings, simply because those strings are long and it saves memory / improves performance. Sometimes there are thousands of categories in each

group data by season according to the exact dates

阅读更多关于 group data by season according to the exact dates

问题 i have a csv file containing 4 years of data and i am trying to group data per season over the 4 years , differently saying, i need to summarize and plot my whole data into 4 season only . here's a look on my data file : timestamp,heure,lat,lon,impact,type 2006-01-01 00:00:00,13:58:43,33.837,-9.205,10.3,1 2006-01-02 00:00:00,00:07:28,34.5293,-10.2384,17.7,1 2007-02-01 00:00:00,23:01:03,35.0617,-1.435,-17.1,2 2007-02-02 00:00:00,01:14:29,36.5685,0.9043,36.8,1 2008-01-01 00:00:00,05:03:51,34

What is the difference between pandas agg and apply function?

阅读更多关于 What is the difference between pandas agg and apply function?

问题 I can't figure out the difference between Pandas .aggregate and .apply functions. Take the following as an example: I load a dataset, do a groupby , define a simple function, and either user .agg or .apply . As you may see, the printing statement within my function results in the same output after using .agg and .apply . The result, on the other hand is different. Why is that? import pandas import pandas as pd iris = pd.read_csv('iris.csv') by_species = iris.groupby('Species') def f(x): ...:

How to pivot a dataframe

阅读更多关于 How to pivot a dataframe

问题 What is pivot? How do I pivot? Is this a pivot? Long format to wide format? I've seen a lot of questions that ask about pivot tables. Even if they don't know that they are asking about pivot tables, they usually are. It is virtually impossible to write a canonical question and answer that encompasses all aspects of pivoting.... ... But I'm going to give it a go. The problem with existing questions and answers is that often the question is focused on a nuance that the OP has trouble

Pandas - Expanding Z-Score Across Multiple Columns

阅读更多关于 Pandas - Expanding Z-Score Across Multiple Columns

问题 I want to calculate an expanding z-score for some time series data that I have in a DataFrame, but I want to standardize the data using the mean and standard deviation of multiple columns, rather than the mean and standard deviation within each column separately. I believe that I want to use some combination of groupby and DataFrame.expanding but I can't seem to figure it out. Here's some example data: import pandas as pd import numpy as np np.random.seed(42) df = pd.DataFrame(np.random.rand

Pandas Groupby using time frequency

阅读更多关于 Pandas Groupby using time frequency

问题 My question is regarding a groupby of pandas dataframe. A sample dataset would look like this: cust_id | date | category A0001 | 20/02/2016 | cat1 A0001 | 24/02/2016 | cat2 A0001 | 02/03/2016 | cat3 A0002 | 03/04/2015 | cat2 Now I want to groupby cust_id and then find events that occur within 30days of each other and compile the list of categories for those. What I have figured so far is to use pd.grouper in the following manner. df.groupby(['cust_id', pd.Grouper(key='date', freq='30D')])[

Why my code didn't select data from Pandas dataframe? [duplicate]

阅读更多关于 Why my code didn't select data from Pandas dataframe? [duplicate]

问题 This question already has answers here : How to filter by month, day, year with Pandas (1 answer) Keep only date part when using pandas.to_datetime (8 answers) Closed last year . Why didn't my date filter work? All others filters work fine. import pandas as pd import datetime data =pd.DataFrame({ 'country': ['USA', 'USA', 'Belarus','Brazil'], 'time': ['2018-01-15 16:11:45.923570+00:00', '2018-01-15 16:19:45.923570+00:00', '2018-01-16 16:12:45.923570+00:00', '2018-01-17 16:14:45.923570+00:00']