pandas-groupby

Using Collect_set after exploding in a groupedBy object in Pyspark

江枫思渺然 提交于 2019-12-11 17:31:55
问题 I have a data-frame which has schema like this : root |-- docId: string (nullable = true) |-- field_a: array (nullable = true) | |-- element: string (containsNull = true) |-- field_b: array (nullable = true) | |-- element: string (containsNull = true) I want to perform a groupBy on field_a and use collect_set to keep all the distinct values (basically inner values in the list) in the field_b in aggregation, I don't want to add a new column by exploding field_b and then do collect_set in

Aggregations over specific columns of a large dataframe, with named output

你。 提交于 2019-12-11 17:14:10
问题 I am looking for a way to aggregate over a large dataframe, possibly using groupby. Each group would be based on either pre-specified columns or regex, and the aggregation should produce a named output. This produces a sample dataframe: import pandas as pd import itertools import numpy as np col = "A,B,C".split(',') col1 = "1,2,3,4,5,6,7,8,9".split(',') col2 = "E,F,G".split(',') all_dims = [col, col1, col2] all_keys = ['.'.join(i) for i in itertools.product(*all_dims)] rng = pd.date_range(end

Python PANDAS: GroupBy First Transform Create Indicator

牧云@^-^@ 提交于 2019-12-11 17:00:36
问题 I have a pandas dataframe in the following format: id,criteria_1,criteria_2,criteria_3,criteria_4,criteria_5,criteria_6 1,0,0,95,179,1,1 1,0,0,97,185,NaN,1 1,1,2,92,120,1,1 2,0,0,27,0,1,NaN 2,1,2,90,179,1,1 2,2,5,111,200,1,1 3,1,2,91,175,1,1 3,0,8,90,27,NaN,NaN 3,0,0,22,0,NaN,NaN I have the following working code: df_final = df[((df['criteria_1'] >=1.0) | (df['criteria_2'] >=2.0)) & (df['criteria_3'] >=90.0) & (df['criteria_4'] <=180.0) & ((df['criteria_5'].notnull()) & (df['criteria_6']

Stacked bar plots from list of dataframes with groupby command

穿精又带淫゛_ 提交于 2019-12-11 16:56:16
问题 I wish to create a (2x3) stacked barchart subplot from results using a groupby.size command, let me explain. I have a list of dataframes: list_df = [df_2011, df_2012, df_2013, df_2014, df_2015, df_2016] . A small example of these df's would be: ... Create Time Location Area Id Beat Priority ... Closed Time 2011-01-01 00:00:00 ST&SAN PABLO AV 1.0 06X 1.0 ... 2011-01-01 00:28:17 2011-01-01 00:01:11 ST&HANNAH ST 1.0 07X 1.0 ... 2011-01-01 01:12:56 . . . (can only add a few columns as the layout

How to include two lambda operations in transform function?

自古美人都是妖i 提交于 2019-12-11 16:55:40
问题 I have a dataframe like as given below df = pd.DataFrame({ 'date' :['2173-04-03 12:35:00','2173-04-03 17:00:00','2173-04-03 20:00:00','2173-04-04 11:00:00','2173-04-04 12:00:00','2173-04-04 11:30:00','2173-04-04 16:00:00','2173-04-04 22:00:00','2173-04-05 04:00:00'], 'subject_id':[1,1,1,1,1,1,1,1,1], 'val' :[5,5,5,10,10,5,5,8,8] }) I would like to apply couple of logic ( logic_1 on val column and logic_2 on date column) to the code. Please find below the logic logic_1 = lambda x: (x.shift(2)

Pandas groupby on a column of lists

别说谁变了你拦得住时间么 提交于 2019-12-11 16:50:00
问题 I have a pandas dataframe with a column that contains lists : df = pd.DataFrame({'List': [['once', 'upon'], ['once', 'upon'], ['a', 'time'], ['there', 'was'], ['a', 'time']], 'Count': [2, 3, 4, 1, 2]}) Count List 2 [once, upon] 3 [once, upon] 4 [a, time] 1 [there, was] 2 [a, time] How can I combine the List columns and sum the Count columns? The expected result is: Count List 5 [once, upon] 6 [a, time] 1 [there, was] I've tried: df.groupby('List')['Count'].sum() which results in: TypeError:

Groupby on columns with overlapping groups

强颜欢笑 提交于 2019-12-11 16:12:37
问题 Continuing from my previous question. This produces a dafatrame with 81 columns and filled with random numbers: import pandas as pd import itertools import numpy as np col = "A,B,C".split(',') col1 = "1,2,3,4,5,6,7,8,9".split(',') col2 = "E,F,G".split(',') all_dims = [col, col1, col2] all_keys = ['.'.join(i) for i in itertools.product(*all_dims)] rng = pd.date_range(end=pd.Timestamp.today().date(), periods=12, freq='M') df = pd.DataFrame(np.random.randint(0, 1000, size=(len(rng), len(all_keys

Python Groupby with Boolean Mask

廉价感情. 提交于 2019-12-11 15:55:54
问题 I have a pandas dataframe with the following general format: id,atr1,atr2,orig_date,fix_date 1,bolt,l,2000-01-01,nan 1,screw,l,2000-01-01,nan 1,stem,l,2000-01-01,nan 2,stem,l,2000-01-01,nan 2,screw,l,2000-01-01,nan 2,stem,l,2001-01-01,2001-01-01 3,bolt,r,2000-01-01,nan 3,stem,r,2000-01-01,nan 3,bolt,r,2001-01-01,2001-01-01 3,stem,r,2001-01-01,2001-01-01 This result would be the following: id,atr1,atr2,orig_date,fix_date,failed_part_ind 1,bolt,l,2000-01-01,nan,0 1,screw,l,2000-01-01,nan,0 1

Error when calling a groupby object inside a Pandas DataFrame

戏子无情 提交于 2019-12-11 15:30:05
问题 I've got this dataframe: person_code #CNAE growth size 0 231 32 0.54 32 1 233 43 0.12 333 2 432 32 0.44 21 3 431 56 0.32 23 4 654 89 0.12 89 5 764 32 0.20 211 6 434 32 0.82 90 I need to create a new column called "top3growth". For that I will need to check df's #CNAE for each row and add an extra column pointing out which are the 3 persons with highest growth for that CNAE (it will add a dataframe inside the df dataframe). To create the "top3dfs" I'm using this groupby: a=sql2.groupby('#CNAE'

Pandas scale multiple columns at once and inverse transform with groupby()

ε祈祈猫儿з 提交于 2019-12-11 15:09:56
问题 I have a dataframe like below.I want to apply two MinMaxscalers on x_data ad y_data on multiple columns and then inverse transform should give me the actual values.Please suggest and help me on this.Thanks in advance DataFrame: X_data y_data Customer 0 1 2 3 Customer 0 1 0 A 855.0 989.0 454.0 574.0 A 395.0 162.0 1 A 989.0 454.0 574.0 395.0 A 162.0 123.0 2 A 454.0 574.0 395.0 162.0 A 123.0 342.0 3 A 574.0 395.0 162.0 123.0 A 342.0 232.0 4 A 395.0 162.0 123.0 342.0 A 232.0 657.0 5 B 875.0 999.0