pandas-groupby

groupby counter of rows

让人想犯罪 __ 提交于 2019-12-20 06:51:03
问题 I am trying to create a new variable which counts how many times had been seen the same id over time. Need to pass from this dataframe id clae6 year quarter 1 475230.0 2007 1 1 475230.0 2007 2 1 475230.0 2007 3 1 475230.0 2007 4 1 475230.0 2008 1 1 475230.0 2008 2 2 475230.0 2007 1 2 475230.0 2007 2 2 475230.0 2007 3 2 475230.0 2007 4 2 475230.0 2008 1 3 475230.0 2010 1 3 475230.0 2010 2 3 475230.0 2010 3 3 475230.0 2010 4 to this id clae6 year quarter new_variable 1 475230.0 2007 1 1 1

Export dask groups to csv

风流意气都作罢 提交于 2019-12-20 06:39:37
问题 I have a single, large, file. It has 40,955,924 lines and is >13GB. I need to be able to separate this file out into individual files based on a single field, if I were using a pd.DataFrame I would use this: for k, v in df.groupby(['id']): v.to_csv(k, sep='\t', header=True, index=False) However, I get the error KeyError: 'Column not found: 0' there is a solution to this specific error on Iterate over GroupBy object in dask, but this requires using pandas to store a copy of the dataframe,

Assign Unique Numeric Group IDs to Groups in Pandas [duplicate]

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-20 03:52:07
问题 This question already has answers here : Q: [Pandas] How to efficiently assign unique ID to individuals with multiple entries based on name in very large df (3 answers) Closed last year . I've consistently run into this issue of having to assign a unique ID to each group in a data set. I've used this when zero padding for RNN's, generating graphs, and many other occasions. This can usually be done by concatenating the values in each pd.groupby column. However, it is often the case the number

Conditionally filling blank values in Pandas dataframes

夙愿已清 提交于 2019-12-20 03:03:01
问题 I have a datafarme which looks like as follows (there are more columns having been dropped off): memberID shipping_country 264991 264991 Canada 100 USA 5000 5000 UK I'm trying to fill the blank cells with existing value of shipping country for each user: memberID shipping_country 264991 Canada 264991 Canada 100 USA 5000 UK 5000 UK However, I'm not sure what's the most efficient way to do this on a large scale dataset. Perhaps, using a vectored groupby method? 回答1: You can use chained groupby

Pandas - expanding mean with groupby

怎甘沉沦 提交于 2019-12-20 02:54:37
问题 I'm trying to get an expanding mean. I can get it to work when I iterate and "group" just by filtering by the specific values, but it takes way too long to do. I feel like this should be an easy application to do with a groupby, but when I do it, it just does the expanding mean to the entire dataset, as opposed to just doing it for each of the groups in grouby. for a quick example: I want to take this (in this particular case, grouped by 'player' and 'year'), and get an expanding mean. player

Groupby class and count missing values in features

£可爱£侵袭症+ 提交于 2019-12-19 12:27:28
问题 I have a problem and I cannot find any solution in the web or documentation, even if I think that it is very trivial. What do I want to do? I have a dataframe like this CLASS FEATURE1 FEATURE2 FEATURE3 X A NaN NaN X NaN A NaN B A A A I want to group by the label( CLASS ) and display the number of NaN-Values that are counted in every feature so that it looks like this. The purpose of this is to get a general idea how missing values are distributed over the different classes. CLASS FEATURE1

Count occurences for each year in pandas dataframe based on subgroup

纵饮孤独 提交于 2019-12-19 09:55:49
问题 Imagine a pandas dataframe that are given by df = pd.DataFrame({ 'id': [1, 1, 1, 2, 2], 'location': [1, 2, 3, 1, 2], 'date': [pd.to_datetime('01-01-{}'.format(year)) for year in [2015, 2016, 2015, 2017, 2018]] }).set_index('id') which looks like this location date id 1 1 2015-01-01 1 2 2016-01-01 1 3 2015-01-01 2 1 2017-01-01 2 2 2018-01-01 Now I want to create a column for each year represented in the date column that counts occurences by id . Hence the resulting data frame should be like

Sliding window iterator using rolling in pandas

前提是你 提交于 2019-12-19 09:28:04
问题 If it's single row, I can get the iterator as following import pandas as pd import numpy as np a = np.zeros((100,40)) X = pd.DataFrame(a) for index, row in X.iterrows(): print index print row Now I want each iterator will return a subset X[0:9, :] , X[5:14, :] , X[10:19, :] etc. How do I achieve this with rolling (pandas.DataFrame.rolling)? 回答1: I'll experiment with the following dataframe. Setup import pandas as pd import numpy as np from string import uppercase def generic_portfolio_df

Pandas groupby and sum total of group

和自甴很熟 提交于 2019-12-19 08:54:12
问题 I have a Pandas DataFrame with customer refund reasons. It contains these example data rows: **case_type** **claim_type** 1 service service 2 service service 3 chargeback service 4 chargeback local_charges 5 service supplier_service 6 chargeback service 7 chargeback service 8 chargeback service 9 chargeback service 10 chargeback service 11 service service_not_used 12 service service_not_used I would like to compare the customer's reason with some sort of labeled reason. This is no problem,

Pandas groupby + transform and multiple columns

若如初见. 提交于 2019-12-19 08:28:24
问题 To obtain results executed on groupby-data with the same level of detail as the original DataFrame (same observation count) I have used the transform function. Example: Original dataframe name, year, grade Jack, 2010, 6 Jack, 2011, 7 Rosie, 2010, 7 Rosie, 2011, 8 After groupby transform name, year, grade, average grade Jack, 2010, 6, 6.5 Jack, 2011, 7, 6.5 Rosie, 2010, 7, 7.5 Rosie, 2011, 8, 7.5 However, with more advanced functions based on multiple columns things get more complicated. What