How to get frequency count of column values for each unique pair of columns in pandas?

问题

I have a Dataframe that looks like below

data = [(datetime.datetime(2021, 2, 10, 7, 49, 7, 118658), u'12.100.90.10', u'100.100.12.1', u'LT_DOWN'),
       (datetime.datetime(2021, 2, 10, 7, 49, 14, 312273), u'12.100.90.10', u'100.100.12.1', u'LT_UP'),
       (datetime.datetime(2021, 2, 10, 7, 49, 21, 535932), u'12.100.90.10', u'100.100.12.1', u'LT_UP'),
       (datetime.datetime(2021, 2, 10, 7, 50, 28, 725961), u'12.100.90.10', u'100.100.12.1', u'PL_DOWN'),
       (datetime.datetime(2021, 2, 10, 7, 50, 32, 450853), u'10.100.80.10', u'10.55.10.1', u'PL_LOW'),
       (datetime.datetime(2021, 2, 10, 7, 51, 32, 450853), u'10.10.80.10', u'10.55.10.1', u'MA_HIGH'),
       (datetime.datetime(2021, 2, 10, 7, 52, 34, 264042), u'10.10.80.10', u'10.55.10.1', u'PL_DOWN')]

As you can see there is data getting logged per minute. I have just presented part of the complete data here.

This is how it looks on loading it in pandas

                        date         start           end     type
0 2021-02-10 07:49:07.118658  12.100.90.10  100.100.12.1  LT_DOWN
1 2021-02-10 07:49:14.312273  12.100.90.10  100.100.12.1    LT_UP
2 2021-02-10 07:49:21.535932  12.100.90.10  100.100.12.1    LT_UP
3 2021-02-10 07:50:28.725961  12.100.90.10  100.100.12.1  PL_DOWN
4 2021-02-10 07:50:32.450853  10.100.80.10    10.55.10.1   PL_LOW
5 2021-02-10 07:51:32.450853   10.10.80.10    10.55.10.1  MA_HIGH
6 2021-02-10 07:52:34.264042   10.10.80.10    10.55.10.1  PL_DOWN

First, I want to get the count of each value in type column on a minute basis (in values for column type, only first part of _ split should be considered for count. So it would look something like

          date     LT PL  MA
0 2021-02-10 07:49 3  0   0
1 2021-02-10 07:50 0  2   0
2 2021-02-10 07:51 0  0   1
3 2021-02-10 07:52 0  1   0

But the above data doesn't tell for every unique pair of start and end column values, what is the count for LT, PL, MA (after split on _).

Thanks to @Sayandip Dutta, he provided the below solution (https://stackoverflow.com/a/66136108/5550284)

pd.crosstab(
       index=df['date'].dt.floor('1min'), 
       columns=[
           df['start'].add('-').add(df['end']).rename('star-end'), 
           df['type'].str.extract(r'(\w+)_', expand=False)
       ], 
       dropna=False
)

Here is how the dataframe looks like

start-end           10.10.80.10-10.55.10.1       10.100.80.10-10.55.10.1       12.100.90.10-100.100.12.1      
type                                    LT MA PL                      LT MA PL                        LT MA PL
date                                                                                                          
2021-02-10 07:49:00                      0  0  0                       0  0  0                         3  0  0
2021-02-10 07:50:00                      0  0  0                       0  0  1                         0  0  1
2021-02-10 07:51:00                      0  1  0                       0  0  0                         0  0  0
2021-02-10 07:52:00                      0  0  1                       0  0  0                         0  0  0

So on converting the above to boolean, it looks like below

start-end           10.10.80.10-10.55.10.1       10.100.80.10-10.55.10.1       12.100.90.10-100.100.12.1      
type                                     LT     MA     PL                      LT     MA     PL                            LT     MA     PL
date                                                                                                          
2021-02-10 07:49:00                      False  False  False                   False  False  False                         True   False  False
2021-02-10 07:50:00                      False  False  False                   False  False  True                          False  False  True
2021-02-10 07:51:00                      False  True  False                    False  False  False                         False  False  False
2021-02-10 07:52:00                      False  False  True                    False  False  False                         False  False  False

Now I want to know, for every unique pair of start and end, what is total count of True for LT, MA and PL. So my final Dataframe should look like

start         end           LT  MA  PL
10.10.80.10   10.55.10.1    0   1   1
10.100.80.10  10.55.10.1    0   0   1
12.100.90.10  100.100.12.1  1   0   1

I just can't seem to figure out how do I extract the required information from the cross tab.

回答1:

You can use numpy.sign with pandas.crosstab:

>>> import numpy as np
>>> np.sign(
             pd.crosstab(
                  index=[df['start'], df['end']], 
                  columns=df['type'].str.extract(r'(\w+)_', expand=False)
                  )
           )

type                       LT  MA  PL
start        end                     
10.10.80.10  10.55.10.1     0   1   1
10.100.80.10 10.55.10.1     0   0   1
12.100.90.10 100.100.12.1   1   0   1

Or, instead of np.sign you can use pandas.DataFrame.clip:

>>> pd.crosstab(
        index=[df['start'], df['end']], 
        columns=df['type'].str.extract(r'(\w+)_', expand=False)
    ).clip(upper=1)

type                       LT  MA  PL
start        end                     
10.10.80.10  10.55.10.1     0   1   1
10.100.80.10 10.55.10.1     0   0   1
12.100.90.10 100.100.12.1   1   0   1

EDIT:

pd.crosstab(
       index=df['date'].dt.floor('1min'), 
       columns=[
           df['start'], 
           df['end'], 
           df['type'].str.extract(r'(\w+)_', expand=False)
      ], 
    ).astype(bool).sum().unstack(-1, fill_value=0)

type                       LT  MA  PL
start        end                     
10.10.80.10  10.55.10.1     0   1   1
10.100.80.10 10.55.10.1     0   0   1
12.100.90.10 100.100.12.1   1   0   1

来源：https://stackoverflow.com/questions/66141707/how-to-get-frequency-count-of-column-values-for-each-unique-pair-of-columns-in-p

标签

python

pandas