comparing values in two pandas dataframes to keep a running count

*爱你&永不变心* 提交于 2021-02-11 13:41:00

问题


My apologies for the length of this but I want to explain as fully as possible. I am completely stumped on how to solve this.

The Setup:

I have two dataframes the first has a list of all possible values in the first column there are no duplicate values in this column. Let's call it df_01. Theses are all the common possible values in each list. All additional columns represent independent lists. Each contains a number that represents how many days any given value of all possible values has been on that list. This dataframe (df_01) has a shape of (9277, 32). These dimensions can change but will general stay the same. The following is a small example of what it looks like.

df_01 before any actions:

index   values   list01   list02  ... list30   list31
  0       aaa      5         1    ...   NaN      83
  1       bbb     NaN       NaN   ...   NaN      4
  2       ccc      20       NaN   ...   NaN      32
  3       ddd      1         27   ...   NaN     NaN
  .        .       .         .    ...    .       .
  .        .       .         .    ...    .       .
  .        .       .         .    ...    .       .  
  9274    qqq     NaN        15   ...   NaN      6 
  9275    rrr     238       NaN   ...   NaN     102
  9276    sss      3         2    ...   NaN     NaN
  9277    ttt      12       NaN   ...   NaN      99

This first dataframe (df_01) will always be the values as they were the previous day.

The second dataframe. Let's call it df_2 will always have less row, and change from day to day in length, but always the same number of columns as (df_01). It currently has a shape of (1351, 32). In this dataframe (df_2) the first column has all the common values from each list as of today, and has no duplicates. The other columns in this dataframe (df_2) have a 1 if the value is on the list today and NaN if it does not. Here's an example.

df_02 before any actions:

index   values   list01   list02  ... list30   list31
  0       aaa      1         1    ...   NaN      1
  1       bbb     NaN        1    ...    1       1
  2       ddd      1         1    ...   NaN     NaN
  .        .       .         .    ...    .       .
  .        .       .         .    ...    .       .
  .        .       .         .    ...    .       .  
  1349    qqq     NaN       NaN   ...    1       1 
  1350    rrr      1        NaN   ...    1      NaN
  1351    sss     NaN        1    ...   NaN      1

The Question:

What I want to accomplish is as follows.

1) For every value in each column if the value exists in the first dataframe (df_01) and not in the second (df_02) its counter in (df_01) will reset to NaN on a per column basis.

2) Then for every value in each column of the second dataframe (df_02) if the value exists in the same column on both dataframes sum the values.

3) If aaa on list01 of (df_01) = 5 and aaa on list01 of (df_2) = 1 then aaa on list01 of (df_02) will become 6. This will keep a running count.

4) If the value is NaN in both no action is needed.

5) If a value is NaN on (df_01) and 1 on (df_02) it stays 1.

*** The value in df_02 will always be a 1 or an NaN prior to the summation. It is a binary choice of whether or not the value in the values column is in the individual list that day or not.

*** Notice value ccc, and, ttt are in (df_01) as they are possible values but not (df_02) as they were on none of the lists today.

*** The asterisk around values like *NaN* or *6* is to denote the values that will change it would not actually be in the data.

The dataframes should look like this after the procedure:

df_01

index   values   list01   list02  ... list30   list31
  0       aaa      5         1    ...   NaN      83
  1       bbb     NaN       NaN   ...   NaN      4
  2       ccc     NaN       NaN   ...   NaN     NaN
  3       ddd      1         27   ...   NaN     NaN
  .        .       .         .    ...    .       .
  .        .       .         .    ...    .       .
  .        .       .         .    ...    .       .  
  9274    qqq     NaN      *NaN*  ...   NaN      6 
  9275    rrr     238       NaN   ...   NaN     *NaN*
  9276    sss    *NaN*       2    ...   NaN      24
  9277    ttt      12       NaN   ...   NaN      99

df_02

index   values   list01   list02  ... list30   list31
  0       aaa     *6*       *2*   ...   NaN     *84*
  1       bbb     NaN        1    ...    1      *5*
  2       ddd     *2*      *28*   ...   NaN     NaN
  .        .       .         .    ...    .       .
  .        .       .         .    ...    .       .
  .        .       .         .    ...    .       .  
  1349    qqq     NaN       NaN   ...    1      *7*
  1350    rrr    *239*      NaN   ...    1      NaN
  1351    sss     NaN       *3*   ...   NaN    *25*

How would I go about accomplishing something like this? I don't even know where to begin. Any ideas, even if not completely working, just to point me in the right direction would be appreciated. Please let me know if anything needs clarification.

Thanks


回答1:


df1 = df1.set_index('values')
df2 = df2.set_index('values')

cols = [*df1.columns]
for col in cols:
    #Update to df1
    df1[col].update(df2.loc[df2[col].isnull(), col].fillna('-'))
    df1[col].replace('-', np.NaN, inplace = True)

    #Update to df2, sum if they both have numbers
    df2[col].update(df2.loc[~df2[col].isnull(), col] + df1.loc[~df1[col].isnull(), col])

This should do what you want. We will loop over each row then update them individually. Make sure the cols list contains the correct columns based on your df's.

The reason we have to use .fillna('-') in the update to df1 is because you can't replace a value with NaN, so we have to fill it with something else, then we can replace it back to NaN.



来源:https://stackoverflow.com/questions/60442425/comparing-values-in-two-pandas-dataframes-to-keep-a-running-count

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!