Find Max Frequency for every Sequence_ID

无人久伴 提交于 2020-01-16 09:49:11

问题


I have a Dataframe Like:

Time         Frq_1   Seq_1       Frq_2   Seq_2       Frq_3   Seq_3
12:43:04     -       30,668      -       30,670      4,620   30,671 
12:46:05     -       30,699      -       30,699      3,280   30,700 
12:46:17     4,200   30,700      -       30,704      -       30,704 
12:46:18     3,060   30,700      4,200   30,700      -       30,700 
12:46:18     3,060   30,700      4,200   30,700      -       30,700 
12:46:19     3,060   30,700      4,220   30,700      -       30,700 
12:46:20     3,060   30,700      4,240   30,700      -       30,700 
12:46:37     -       30,698      -       30,699      3,060   30,700 
12:46:38     -       30,699      3,060   30,700      4,600   30,700 
12:47:19     -       30,668      -       30,669      -       30,669 
12:47:20     -       30,667      -       30,667      -       30,668 
12:47:20     -       30,667      -       30,667      -       30,668 
12:47:21     -       30,667      -       30,667      -       30,668 
12:47:21     -       30,665      -       30,665      -       30,665 
12:47:22     -       30,665      -       30,665      -       30,665 
12:48:35     -       30,688      -       30,690      3,020   30,690 
12:49:29     4,160   30,690      -       30,691      -       30,693 

I want check the total dataframe and find the result with below condition:

  1. Sequence_ID for which Frequency is not null
  2. Sequence_ID for which Frequency is Max (in case of multiple Sequence_ID with non zero Frequency)

I want my result as below:

Time         Sequence_ID    Frequency
12:43:04     4,620          30,671 
12:46:18     4,200          30,700 
12:49:29     4,160          30,690 

Time = correspond to row of (Sequence_ID & Frequency)


回答1:


This turned out to be quite involved. Here we go anyway:

long_df = pd.wide_to_long(df.reset_index(), stubnames=['Seq_', 'Frq_'], 
                          suffix='\d+', i='index', j='j')
long_df['Frq_'] = pd.to_numeric(long_df.Frq_.str.replace(',','.')
                                .replace('-',float('nan')))
long_df.reset_index(drop=True, inplace=True)
ix = long_df.groupby('Seq_').Frq_.idxmax()

print(long_df.loc[ix[ix.notna()].values.astype(int)])

     Time      Seq_   Frq_
34  12:43:04  30,671  4.62
16  12:49:29  30,690  4.16
42  12:46:38  30,700  4.60

Seems like for the sequence 30,700, the highest frequency is 4.60, not 4.20


The first step is to collapse the dataframe into three rows, one for the Time, another for the sequence and for the frequency. We can use pd.wide_to_long with the stubnames ['Seq_', 'Frq_']:

long_df = pd.wide_to_long(df.reset_index(), stubnames=['Seq_', 'Frq_'], 
                              suffix='\d+', i='index', j='j')

print(long_df)

            Time    Seq_   Frq_
index j                         
0     1  12:43:04  30,668      -
1     1  12:46:05  30,699      -
2     1  12:46:17  30,700  4,200
3     1  12:46:18  30,700  3,060
4     1  12:46:18  30,700  3,060
5     1  12:46:19  30,700  3,060
6     1  12:46:20  30,700  3,060
7     1  12:46:37  30,698      -
8     1  12:46:38  30,699      -
9     1  12:47:19  30,668      -
10    1  12:47:20  30,667      -
11    1  12:47:20  30,667      -
12    1  12:47:21  30,667      -
13    1  12:47:21  30,665      -
14    1  12:47:22  30,665      -
15    1  12:48:35  30,688      -
16    1  12:49:29  30,690  4,160
...

The next step is to cast to float the fequencies to float, to be able to find the maximum values:

long_df['Frq_'] = pd.to_numeric(long_df.Frq_.str.replace(',','.')
                                    .replace('-',float('nan')))

print(long_df)

          Time    Seq_  Frq_
index j                        
0     1  12:43:04  30,668   NaN
1     1  12:46:05  30,699   NaN
2     1  12:46:17  30,700  4.20
3     1  12:46:18  30,700  3.06
4     1  12:46:18  30,700  3.06
5     1  12:46:19  30,700  3.06
6     1  12:46:20  30,700  3.06
7     1  12:46:37  30,698   NaN
... 

Then we can groupby Seq_ and find the indices with the highest values. One could also think of using max, but this would remove the Time column.

long_df.reset_index(drop=True, inplace=True)
ix = long_df.groupby('Seq_').Frq_.idxmax()

And finally index based on the above:

print(long_df.loc[ix[ix.notna()].values.astype(int)])

     Time      Seq_   Frq_
34  12:43:04  30,671  4.62
16  12:49:29  30,690  4.16
42  12:46:38  30,700  4.60


来源:https://stackoverflow.com/questions/58102325/find-max-frequency-for-every-sequence-id

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!