pandas-groupby

Understanding the execution of DataFrame in python

独自空忆成欢 提交于 2020-03-03 07:07:05
问题 I am new to python and i want to understand how the execution takes place in a DataFrame. let's try this with an example from the dataset found in the kaggle.com( Titanic: Machine Learning from Disaster ). I wanted to replace the NaN value with the mean() for the respective sex . ie. the NaN value for Men should be replaced by the mean of the mens age and vice versa. now i achieved this by using this line of code _data['new_age']=_data['new_age'].fillna(_data.groupby('Sex')['Age'].transform(

Get unique values of multiple columns as a new dataframe in pandas

不羁岁月 提交于 2020-02-13 07:50:47
问题 Having pandas data frame df with at least columns C1,C2,C3 how would you get all the unique C1,C2,C3 values as a new DataFrame? in other words, similiar to : SELECT C1,C2,C3 FROM T GROUP BY C1,C2,C3 Tried that print df.groupby(by=['C1','C2','C3']) but im getting <pandas.core.groupby.DataFrameGroupBy object at 0x000000000769A9E8> 回答1: I believe you need drop_duplicates if want all unique triples: df = df.drop_duplicates(subset=['C1','C2','C3']) If want use groupby add first: df = df.groupby(by

How to group and sum some results in others ( Style format in Euros)

折月煮酒 提交于 2020-02-07 05:28:07
问题 I want to make a pie chart over Europe and some specific countries, i need to groupe and sum some countries or companies in a group call "Others", for example: all the companies that have the budget less than 10000 euros. import pandas as pd from pandas import Series, DataFrame import numpy as np import matplotlib.pyplot as plt Year Project Entity Participation Country Budget 0 2015 671650 - MMMAGIC - 5G FUNDACION IMDEA NETWORK* Participant Spain € 304,000 1 2015 671650 - MMMAGIC - 5G ROHDE &

Groupby count based on year and specific condition

谁都会走 提交于 2020-02-07 02:35:50
问题 I have a dataframe as shown below Tenancy_ID Unit_ID Tenancy_End_Date 1 A 2012-09-06 11:34:15 2 B 2013-09-08 10:35:18 3 A 2014-09-06 11:34:15 4 C 2014-09-06 11:34:15 5 B 2015-09-06 11:34:15 6 A 2014-09-06 11:34:15 5 A 2015-09-06 11:34:15 7 A 2019-09-06 11:34:15 4 C 2014-01-06 11:34:15 5 C 2014-05-06 11:34:15 From the above I would like to generate below dataframe Expected Output: Unit_ID NoC_2012 NoC_2013 NoC_2014 NoC_2015 NoC_2016 NoC_2017 NoC_2018 NoC_2019 A 1 0 2 1 0 0 0 1 B 0 1 0 1 0 0 0

How to generate unique id and sub_id for each group

戏子无情 提交于 2020-02-06 02:41:52
问题 My goal is to generate an id (id trajectory) and a sub id (under trajectory) for each group (u_uuid and p_uuid). I tried the ngroup function and it didn't work data = [ {'u_uuid': 110, 'p_uuid': 'aaa', 'mode': 'walk', 'dest': 'work'}, {'u_uuid': 110, 'p_uuid': 'aaa', 'mode': 'walk', 'dest': 'work'}, {'u_uuid': 110, 'p_uuid': 'aaa', 'mode': 'bus', 'dest': 'work'}, {'u_uuid': 110, 'p_uuid': 'aaa', 'mode': 'bus', 'dest': 'work'}, {'u_uuid': 110, 'p_uuid': 'aaa', 'mode': 'walk', 'dest': 'work'},

Pandas simple correlation of two grouped DataFrame columns

孤街浪徒 提交于 2020-02-05 06:34:30
问题 Is there a good way to get the simple correlation of two grouped DataFrame columns? It seems like no matter what the pandas .corr() functions want to return a correlation matrix. E.g., i = pd.MultiIndex.from_product([['A','B','C'], np.arange(1, 11, 1)], names=['Name','Num']) test = pd.DataFrame(np.random.randn(30, 2), i, columns=['X', 'Y']) test.groupby(['Name'])['X','Y'].corr() returns X Y Name A X 1.000000 0.152663 Y 0.152663 1.000000 B X 1.000000 -0.155113 Y -0.155113 1.000000 C X 1.000000

Pandas grouping and resampling for a bar plot:

情到浓时终转凉″ 提交于 2020-02-05 02:47:54
问题 I have a dataframe that records concentrations for several different locations in different years, with a high temporal frequency (<1 hour). I am trying to make a bar/multibar plot showing mean concentrations, at different locations in different years To calculate mean concentration, I have to apply quality control filters to daily and monthly data. My approach is to first apply filters and resample per year and then do the grouping by location and year. Also, out of all the locations (in the

Pandas grouping and resampling for a bar plot:

和自甴很熟 提交于 2020-02-05 02:46:26
问题 I have a dataframe that records concentrations for several different locations in different years, with a high temporal frequency (<1 hour). I am trying to make a bar/multibar plot showing mean concentrations, at different locations in different years To calculate mean concentration, I have to apply quality control filters to daily and monthly data. My approach is to first apply filters and resample per year and then do the grouping by location and year. Also, out of all the locations (in the

python: pandas: how to find max value in a column based on groupby another column

孤街醉人 提交于 2020-02-04 13:19:31
问题 I want to group my dataframe based on one column SERVER and than find max value in other column JOB_ID. DF: SERVER JOB_ID LOG_FILE TIME 0 abc_123 1 1/abc_123/dep2/1/123.log 2019-12-05T05:06:16.346Z 1 abc_123 10 1/abc_123/dep2/10/123.log 2019-12-04T17:05:28.335Z 2 abc_123 11 1/abc_123/dep2/11/123.log 2019-12-04T20:27:03.988Z 3 abc_123 12 1/abc_123/dep2/12/123.log 2019-12-04T20:35:49.039Z 4 abc_123 13 1/abc_123/dep2/13/123.log 2019-12-04T20:42:36.890Z 5 abc_123 14 1/abc_123/dep2/14/123.log 2019

python: pandas: how to find max value in a column based on groupby another column

风格不统一 提交于 2020-02-04 13:19:10
问题 I want to group my dataframe based on one column SERVER and than find max value in other column JOB_ID. DF: SERVER JOB_ID LOG_FILE TIME 0 abc_123 1 1/abc_123/dep2/1/123.log 2019-12-05T05:06:16.346Z 1 abc_123 10 1/abc_123/dep2/10/123.log 2019-12-04T17:05:28.335Z 2 abc_123 11 1/abc_123/dep2/11/123.log 2019-12-04T20:27:03.988Z 3 abc_123 12 1/abc_123/dep2/12/123.log 2019-12-04T20:35:49.039Z 4 abc_123 13 1/abc_123/dep2/13/123.log 2019-12-04T20:42:36.890Z 5 abc_123 14 1/abc_123/dep2/14/123.log 2019