pandas-groupby | 易学教程

Python re-sampling time series data which can not be indexed

阅读更多关于 Python re-sampling time series data which can not be indexed

问题 The purpose of this question is to know how many trades "happened" in each second (count) as well as the total volume traded (sum). I have time series data which can not be indexed (as there are multiply entries with the same time-stamp - can get many trades on the same millisecond) and therefor the use of resample as explained here can not work. Another approach was to first to do group by time as shown here (and later to resample per seconds). The problem is that grouping will cause only

Left join in pandas without the creation of left and right variables

阅读更多关于 Left join in pandas without the creation of left and right variables

问题 I'm missing something in the syntax of merging in pandas. I have the following 2 data frames: >>> dfA s_name geo zip date value 0 A002X zip 60601 2010 None 1 A002Y zip 60601 2010 None 2 A003X zip 60601 2010 None 3 A003Y zip 60601 2010 None (or potentially some values exist which will not overlap with dfB: >>> dfA_alternate s_name geo zip date value 0 A002X zip 60601 2010 NaN 1 A002Y zip 60601 2010 2.0 2 A003X zip 60601 2010 NaN 3 A003Y zip 60601 2010 NaN ) And >>> dfB s_name geo zip date

Adding new column to pandas DataFrame results in NaN

阅读更多关于 Adding new column to pandas DataFrame results in NaN

问题 I have a pandas DataFrame data with the following transaction data: A date 0 M000833 2016-08-01 1 M000833 2016-08-01 2 M000833 2016-08-02 3 M000833 2016-08-02 4 M000511 2016-08-05 I want a new column with the count of number of visits (multiple visits per day should be treated as 1) per consumer. So I tried this: import pandas as pd data['noofvisits'] = data.groupby(['A'])['date'].nunique() When I just run the statement without assigning it to the DataFrame, I get a pandas series with the

How to use pd.concat with an un initiated dataframe?

阅读更多关于 How to use pd.concat with an un initiated dataframe?

问题 I want to be able to concat dataframe results to memory as they go through a function and end up with a whole new dataframe with just the results. How do I do this without having a dataframe all ready created before the function? For example: import pandas as pd import numpy as np rand_df = pd.DataFrame({'A': [ 'x','x','y','y','z','z','z'],'B': np.random.randn(7)}) def myFuncOnDF(df, row): df = df.groupby(['A']).get_group(row).describe() myFuncOnDF(rand_df, 'x') myFuncOnDF(rand_df, 'y')

Pandas: need to count the number of values of a column between 0 and 0.001 then 0.001 and 0.002 etc

阅读更多关于 Pandas: need to count the number of values of a column between 0 and 0.001 then 0.001 and 0.002 etc

问题 My code so far looks like this: conn = psycopg2.connect("dbname=monty user=postgres host=localhost password=postgres") cur = conn.cursor() cur.execute("SELECT * FROM binance.zrxeth_ob_indicators;") row = cur.fetchall() df = pd.DataFrame(row,columns=['timestamp', 'topAsk', 'topBid', 'CPA', 'midprice', 'CPB', 'spread', 'CPA%', 'CPB%']) ranges = (0, 0.05, 0.1, 0.15 ,0.2, 0.25, 0.3, 0.35, 0.4) all_onbservations = df['CPA%'].groupby(pd.cut(df['CPA%'], ranges)).count() I can count them for a

Group by one column and show the availability of specific values from another column

阅读更多关于 Group by one column and show the availability of specific values from another column

问题 I have this dataframe: df1: drug_id illness lexapro.1 HD lexapro.1 MS lexapro.2 HDED lexapro.2 MS lexapro.2 MS lexapro.3 CD lexapro.3 Sweat lexapro.4 HD lexapro.5 WD lexapro.5 FN I am going to first group the data based on drug_id, and search for availability of HD, MS, and FN in the illness column. Then fill in the second data frame like this: df2: drug_id HD MS FN lexapro.1 1 1 0 lexapro.2 0 1 0 lexapro.3 0 0 0 lexapro.4 1 0 0 lexapro.5 0 0 1 This is my code for grouping. df1.groupby('drug

Calculate the percentage increase or decrease based on the previous column value of the same row in pandas dataframe

阅读更多关于 Calculate the percentage increase or decrease based on the previous column value of the same row in pandas dataframe

问题 My dataframe has 20 columns and multiple rows. I want to calculate the percentage increase or decrease based on the previous column value but the same row. if a previous value is not available (in the first column) I want 100 in that place. I have tried the shift(-1) method of pandas but it's not working. Dataframe: A B C D E F 10 20 25 50 150 100 100 130 195 150 250 250 Expected: A B C D E F 100 100 25 100 200 -33 100 30 50 -23 66 0 回答1: I suppose you can use shift(axis=1) : (df.diff(axis=1)

create a matrix with two dataframe - pandas?

阅读更多关于 create a matrix with two dataframe - pandas?

问题 I have two data, one with columns: df1 = ID As Hs Ts A A_1 A_6 A_7 B B_1 C C_1 C10 D D_1 E E_1,E_2 E_5 E_4 F F_1,F_4 one with pair scores : df2 = ID1 1 ID2 2 SCORE A A_1 B B_1 1 A A_6 B B_1 0.5 A A_7 B B_1 0.3 A A_1 C C_1 1 A A_6 C C_1 0.4 A A_7 C C_1 0.3 A A_1 C C_10 0.3 A A_6 C C_10 0.5 A A_7 C C_10 0.3 A A_1 D D_1 1 A A_6 D D_1 0.2 A A_7 D D_1 0.3 A A_1 E E_1 1 A A_6 E E_1 0.5 A A_7 E E_1 0.4 A A_1 E E_2 0.8 A A_6 E E_2 0.2 A A_7 E E_2 0.5 A A_1 E E_5 0.3 A A_6 E E_5 0.3 A A_7 E E_5 0.6 A

Complex Grouping of dataframe with operations and creation of new columns

阅读更多关于 Complex Grouping of dataframe with operations and creation of new columns

问题 I have a question and was not able to find a good answer which I can apply. It seems to be more complex than I thought: This is my current dataframe df= [customerid, visit_number, date, purchase_amount] [1, 38, 01-01-2019, 40 ] [1, 39, 01-03-2019, 20 ] [2, 10, 01-02-2019, 60 ] [2, 14, 01-05-2019, 0 ] [3, 10, 01-01-2019, 5 ] What I am looking for is to aggregate this table where I end up with 1 row per 1 customer and also with additional derived columns from the original like this: df_new=

pandas get 30 day rolling window over n years

阅读更多关于 pandas get 30 day rolling window over n years

问题 I'm trying to grab a 30 day window going backwards from all dates in a dataframe but also look at the same 30 day window across all of the years in the dataset. The dates are from 2000-2019. For for example starting on 1st Feb 2000, I would like to grab the previous 30 days, and the 30 days before 1st Feb in all other years. I can get a rolling window to work over n days for a z-score: dt= pd.date_range(start='2000-01-01', end='2019-03-01') x=[randint(0,100) for x in range(len(dt))] DTX = pd