pandas-groupby

Cannot get groupby records based on their minimum value using pandas in python

﹥>﹥吖頭↗ 提交于 2019-12-01 23:54:42
I have the following csv id;price;editor k1;10,00;ed1 k1;8,00;ed2 k3;10,00;ed1 k3;11,00;ed2 k2;10,50;ed1 k1;9,50;ed3 If I do the following import pandas as pd df = pd.read_csv('Testing.csv', delimiter =';') df_reduced= df.groupby(['id', 'editor'])['price'].min() Instead of getting k1;8,00;ed2 k2;10,50;ed1 k3;10,00;ed1 I get k1;10,00;ed1 8,00;ed2 9,50;ed3 k2;10,50;ed1 k3;10,00;ed1 11,00;ed2 So can I get three id's with their minimum values? Group the data by only id and find min price for each group. Index the original dataframe based on the minimum values to include the editor column. Note: I

I applied sum() on a groupby and I want to sort the values of the last column

╄→尐↘猪︶ㄣ 提交于 2019-12-01 20:34:46
Given the following DataFrame user_ID product_id amount 1 456 1 1 87 1 1 788 3 1 456 5 1 87 2 ... ... ... The first column is the ID of the customer, the second is the ID of the product he bought and the 'amount' express if the quantity of the product purchased on that given day (the date is also taken into consideration). a customer can buy many products each day as much as he wants to. I want to calculate the total of times each product is bought by the customer, so I applied a groupby df.groupby(['user_id','product_id'], sort=True).sum() now I want to sort the sum of amount in each group.

Pandas Reindex to Fill Missing Dates, or Better Method to Fill?

↘锁芯ラ 提交于 2019-12-01 18:37:41
My data is absence records from a factory. Some days there are no absences so there is no data or date recorded for that day. However, and where this gets hairy with the other examples shown, is on any given day there can be several absences for various reasons. There is not always a 1 to 1 ratio of date-to-record in the data. The result I'm hoping for is something like this: (index) Shift Description Instances (SUM) 01-01-14 2nd Baker Discipline 0 01-01-14 2nd Baker Vacation 0 01-01-14 1st Cooks Discipline 0 01-01-14 1st Cooks Vacation 0 01-02-14 2nd Baker Discipline 4 01-02-14 2nd Baker

Filtering pandas dataframe by day

喜欢而已 提交于 2019-12-01 17:56:04
问题 I have a pandas data frame with forex data by minutes, one year long (371635 rows): O H L C 0 2017-01-02 02:00:00 1.05155 1.05197 1.05155 1.05190 2017-01-02 02:01:00 1.05209 1.05209 1.05177 1.05179 2017-01-02 02:02:00 1.05177 1.05198 1.05177 1.05178 2017-01-02 02:03:00 1.05188 1.05200 1.05188 1.05200 2017-01-02 02:04:00 1.05196 1.05204 1.05196 1.05203 I want to filter daily data to get an hour range: dt = datetime(2017,1,1) df_day = df1[df.index.date == dt.date()] df_day_t = df_day.between

Filtering pandas dataframe by day

我们两清 提交于 2019-12-01 17:47:27
I have a pandas data frame with forex data by minutes, one year long (371635 rows): O H L C 0 2017-01-02 02:00:00 1.05155 1.05197 1.05155 1.05190 2017-01-02 02:01:00 1.05209 1.05209 1.05177 1.05179 2017-01-02 02:02:00 1.05177 1.05198 1.05177 1.05178 2017-01-02 02:03:00 1.05188 1.05200 1.05188 1.05200 2017-01-02 02:04:00 1.05196 1.05204 1.05196 1.05203 I want to filter daily data to get an hour range: dt = datetime(2017,1,1) df_day = df1[df.index.date == dt.date()] df_day_t = df_day.between_time('08:30', '09:30') If I do a for loop with 200 days, it takes minutes. I suspect that at every step

Groupby class and count missing values in features

瘦欲@ 提交于 2019-12-01 15:44:19
I have a problem and I cannot find any solution in the web or documentation, even if I think that it is very trivial. What do I want to do? I have a dataframe like this CLASS FEATURE1 FEATURE2 FEATURE3 X A NaN NaN X NaN A NaN B A A A I want to group by the label( CLASS ) and display the number of NaN-Values that are counted in every feature so that it looks like this. The purpose of this is to get a general idea how missing values are distributed over the different classes. CLASS FEATURE1 FEATURE2 FEATURE3 X 1 1 2 B 0 0 0 I know how to recieve the amount of nonnull -Values - df.groupby['CLASS'

Pandas group by time with specified start time with non integer minutes

给你一囗甜甜゛ 提交于 2019-12-01 09:23:27
I have a dataframe with one hour long signals. I want to group them in 10 minutes buckets. The problem is that the starting time is not precisely a "multiple" of 10 minutes, therefore, instead of obtaining 6 groups, I obtain 7 with the first and the last incomplete. The issue can be easily reproduced doing import pandas as pd import numpy as np import datetime as dt rng = pd.date_range('1/1/2011 00:05:30', periods=3600, freq='1S') ts = pd.DataFrame({'a':np.random.randn(len(rng)),'b':np.random.randn(len(rng))}, index=rng) interval = dt.timedelta(minutes=10) ts.groupby(pd.Grouper(freq=interval))

Deleting rows based on values in other rows

主宰稳场 提交于 2019-12-01 08:34:22
I was looking for a way to drop rows from my dataframe based on conditions to be checked with values in another row. Here is my dataframe: product product_id account_status prod-A 100 active prod-A 100 cancelled prod-A 300 active prod-A 400 cancelled If a row with account_status='active' exists for a product & and product_id combination, then retain this row and delete other rows. The desired output is: product product_id account_status prod-A 100 active prod-A 300 active prod-A 400 cancelled I saw the solution mentioned here but couldn't replicate it for strings. Please suggest. For more

Pandas groupby + transform and multiple columns

Deadly 提交于 2019-12-01 08:11:00
To obtain results executed on groupby-data with the same level of detail as the original DataFrame (same observation count) I have used the transform function. Example: Original dataframe name, year, grade Jack, 2010, 6 Jack, 2011, 7 Rosie, 2010, 7 Rosie, 2011, 8 After groupby transform name, year, grade, average grade Jack, 2010, 6, 6.5 Jack, 2011, 7, 6.5 Rosie, 2010, 7, 7.5 Rosie, 2011, 8, 7.5 However, with more advanced functions based on multiple columns things get more complicated. What puzzles me is that I seem to be unable to access multiple columns in a groupby-transform combination.

Assign control vs. treatment groupings randomly based on % for more than 2 groups

我只是一个虾纸丫 提交于 2019-12-01 08:09:40
问题 Piggy backing off my own previous question python pandas: assign control vs. treatment groupings randomly based on % Thanks to @maxU, I know how to assign random control/treatment groupings to 2 groups; but what if I have 3 groups or more? For example: df.head() customer_id | Group | many other columns ABC 1 CDE 3 BHF 2 NID 1 WKL 3 SDI 2 JSK 1 OSM 3 MPA 2 MAD 1 pd.pivot_table(df,index=['Group'],values=["customer_id"],aggfunc=lambda x: len(x.unique())) Group 1 : 270 Group 2 : 180 Group 3 : 330