pandas-groupby

Pandas groupby/apply has different behaviour with int and string types

对着背影说爱祢 提交于 2020-01-01 17:09:15
问题 I have the following dataframe X Y 0 A 10 1 A 9 2 A 8 3 A 5 4 B 100 5 B 90 6 B 80 7 B 50 and two different functions that are very similar def func1(x): if x.iloc[0]['X'] == 'A': x['D'] = 1 else: x['D'] = 0 return x[['X', 'D']] def func2(x): if x.iloc[0]['X'] == 'A': x['D'] = 'u' else: x['D'] = 'v' return x[['X', 'D']] Now I can groupby/apply these functions df.groupby('X').apply(func1) df.groupby('X').apply(func2) The first line gives me what I want, i.e. X D 0 A 1 1 A 1 2 A 1 3 A 1 4 B 0 5

Pandas Reindex to Fill Missing Dates, or Better Method to Fill?

余生长醉 提交于 2019-12-31 01:40:27
问题 My data is absence records from a factory. Some days there are no absences so there is no data or date recorded for that day. However, and where this gets hairy with the other examples shown, is on any given day there can be several absences for various reasons. There is not always a 1 to 1 ratio of date-to-record in the data. The result I'm hoping for is something like this: (index) Shift Description Instances (SUM) 01-01-14 2nd Baker Discipline 0 01-01-14 2nd Baker Vacation 0 01-01-14 1st

pandas groupby dropping columns

混江龙づ霸主 提交于 2019-12-30 08:10:26
问题 I'm doing a simple group by operation, trying to compare group means. As you can see below, I have selected specific columns from a larger dataframe, from which all missing values have been removed. But when I group by, I am losing a couple of columns: I have never encountered this with pandas, and I'm not finding anything else on stack overflow that is all that similar. Does anybody have any insight? 回答1: I think it is Automatic exclusion of 'nuisance' columns , what described here. Sample:

Pandas groupby with categories with redundant nan

一笑奈何 提交于 2019-12-29 20:17:05
问题 I am having issues using pandas groupby with categorical data. Theoretically, it should be super efficient: you are grouping and indexing via integers rather than strings. But it insists that, when grouping by multiple categories, every combination of categories must be accounted for. I sometimes use categories even when there's a low density of common strings, simply because those strings are long and it saves memory / improves performance. Sometimes there are thousands of categories in each

group data by season according to the exact dates

笑着哭i 提交于 2019-12-29 07:59:09
问题 i have a csv file containing 4 years of data and i am trying to group data per season over the 4 years , differently saying, i need to summarize and plot my whole data into 4 season only . here's a look on my data file : timestamp,heure,lat,lon,impact,type 2006-01-01 00:00:00,13:58:43,33.837,-9.205,10.3,1 2006-01-02 00:00:00,00:07:28,34.5293,-10.2384,17.7,1 2007-02-01 00:00:00,23:01:03,35.0617,-1.435,-17.1,2 2007-02-02 00:00:00,01:14:29,36.5685,0.9043,36.8,1 2008-01-01 00:00:00,05:03:51,34

What is the difference between pandas agg and apply function?

浪尽此生 提交于 2019-12-27 22:16:28
问题 I can't figure out the difference between Pandas .aggregate and .apply functions. Take the following as an example: I load a dataset, do a groupby , define a simple function, and either user .agg or .apply . As you may see, the printing statement within my function results in the same output after using .agg and .apply . The result, on the other hand is different. Why is that? import pandas import pandas as pd iris = pd.read_csv('iris.csv') by_species = iris.groupby('Species') def f(x): ...:

How to pivot a dataframe

爷,独闯天下 提交于 2019-12-25 11:45:58
问题 What is pivot? How do I pivot? Is this a pivot? Long format to wide format? I've seen a lot of questions that ask about pivot tables. Even if they don't know that they are asking about pivot tables, they usually are. It is virtually impossible to write a canonical question and answer that encompasses all aspects of pivoting.... ... But I'm going to give it a go. The problem with existing questions and answers is that often the question is focused on a nuance that the OP has trouble

Pandas - Expanding Z-Score Across Multiple Columns

我与影子孤独终老i 提交于 2019-12-25 08:59:18
问题 I want to calculate an expanding z-score for some time series data that I have in a DataFrame, but I want to standardize the data using the mean and standard deviation of multiple columns, rather than the mean and standard deviation within each column separately. I believe that I want to use some combination of groupby and DataFrame.expanding but I can't seem to figure it out. Here's some example data: import pandas as pd import numpy as np np.random.seed(42) df = pd.DataFrame(np.random.rand

Pandas Groupby using time frequency

痞子三分冷 提交于 2019-12-25 04:09:04
问题 My question is regarding a groupby of pandas dataframe. A sample dataset would look like this: cust_id | date | category A0001 | 20/02/2016 | cat1 A0001 | 24/02/2016 | cat2 A0001 | 02/03/2016 | cat3 A0002 | 03/04/2015 | cat2 Now I want to groupby cust_id and then find events that occur within 30days of each other and compile the list of categories for those. What I have figured so far is to use pd.grouper in the following manner. df.groupby(['cust_id', pd.Grouper(key='date', freq='30D')])[

Why my code didn't select data from Pandas dataframe? [duplicate]

余生长醉 提交于 2019-12-25 03:45:07
问题 This question already has answers here : How to filter by month, day, year with Pandas (1 answer) Keep only date part when using pandas.to_datetime (8 answers) Closed last year . Why didn't my date filter work? All others filters work fine. import pandas as pd import datetime data =pd.DataFrame({ 'country': ['USA', 'USA', 'Belarus','Brazil'], 'time': ['2018-01-15 16:11:45.923570+00:00', '2018-01-15 16:19:45.923570+00:00', '2018-01-16 16:12:45.923570+00:00', '2018-01-17 16:14:45.923570+00:00']