问题
I have a Dataframe that looks like below
data = [(datetime.datetime(2021, 2, 10, 7, 49, 7, 118658), u'12.100.90.10', u'100.100.12.1', u'LT_DOWN'),
(datetime.datetime(2021, 2, 10, 7, 49, 14, 312273), u'12.100.90.10', u'100.100.12.1', u'LT_UP'),
(datetime.datetime(2021, 2, 10, 7, 49, 21, 535932), u'12.100.90.10', u'100.100.12.1', u'LT_UP'),
(datetime.datetime(2021, 2, 10, 7, 50, 28, 725961), u'12.100.90.10', u'100.100.12.1', u'PL_DOWN'),
(datetime.datetime(2021, 2, 10, 7, 50, 32, 450853), u'10.100.80.10', u'10.55.10.1', u'PL_LOW'),
(datetime.datetime(2021, 2, 10, 7, 51, 32, 450853), u'10.10.80.10', u'10.55.10.1', u'MA_HIGH'),
(datetime.datetime(2021, 2, 10, 7, 52, 34, 264042), u'10.10.80.10', u'10.55.10.1', u'PL_DOWN')]
As you can see there is data getting logged per minute. I have just presented part of the complete data here.
This is how it looks on loading it in pandas
date start end type
0 2021-02-10 07:49:07.118658 12.100.90.10 100.100.12.1 LT_DOWN
1 2021-02-10 07:49:14.312273 12.100.90.10 100.100.12.1 LT_UP
2 2021-02-10 07:49:21.535932 12.100.90.10 100.100.12.1 LT_UP
3 2021-02-10 07:50:28.725961 12.100.90.10 100.100.12.1 PL_DOWN
4 2021-02-10 07:50:32.450853 10.100.80.10 10.55.10.1 PL_LOW
5 2021-02-10 07:51:32.450853 10.10.80.10 10.55.10.1 MA_HIGH
6 2021-02-10 07:52:34.264042 10.10.80.10 10.55.10.1 PL_DOWN
First, I want to get the count of each value in type column on a minute basis (in values for column type, only first part of _ split should be considered for count. So it would look something like
date LT PL MA
0 2021-02-10 07:49 3 0 0
1 2021-02-10 07:50 0 2 0
2 2021-02-10 07:51 0 0 1
3 2021-02-10 07:52 0 1 0
But the above data doesn't tell for every unique pair of start and end column values, what is the count for LT, PL, MA (after split on _).
Thanks to @Sayandip Dutta, he provided the below solution (https://stackoverflow.com/a/66136108/5550284)
pd.crosstab(
index=df['date'].dt.floor('1min'),
columns=[
df['start'].add('-').add(df['end']).rename('star-end'),
df['type'].str.extract(r'(\w+)_', expand=False)
],
dropna=False
)
Here is how the dataframe looks like
start-end 10.10.80.10-10.55.10.1 10.100.80.10-10.55.10.1 12.100.90.10-100.100.12.1
type LT MA PL LT MA PL LT MA PL
date
2021-02-10 07:49:00 0 0 0 0 0 0 3 0 0
2021-02-10 07:50:00 0 0 0 0 0 1 0 0 1
2021-02-10 07:51:00 0 1 0 0 0 0 0 0 0
2021-02-10 07:52:00 0 0 1 0 0 0 0 0 0
So on converting the above to boolean, it looks like below
start-end 10.10.80.10-10.55.10.1 10.100.80.10-10.55.10.1 12.100.90.10-100.100.12.1
type LT MA PL LT MA PL LT MA PL
date
2021-02-10 07:49:00 False False False False False False True False False
2021-02-10 07:50:00 False False False False False True False False True
2021-02-10 07:51:00 False True False False False False False False False
2021-02-10 07:52:00 False False True False False False False False False
Now I want to know, for every unique pair of start and end, what is total count of True for LT, MA and PL. So my final Dataframe should look like
start end LT MA PL
10.10.80.10 10.55.10.1 0 1 1
10.100.80.10 10.55.10.1 0 0 1
12.100.90.10 100.100.12.1 1 0 1
I just can't seem to figure out how do I extract the required information from the cross tab.
回答1:
You can use numpy.sign with pandas.crosstab:
>>> import numpy as np
>>> np.sign(
pd.crosstab(
index=[df['start'], df['end']],
columns=df['type'].str.extract(r'(\w+)_', expand=False)
)
)
type LT MA PL
start end
10.10.80.10 10.55.10.1 0 1 1
10.100.80.10 10.55.10.1 0 0 1
12.100.90.10 100.100.12.1 1 0 1
Or, instead of np.sign you can use pandas.DataFrame.clip:
>>> pd.crosstab(
index=[df['start'], df['end']],
columns=df['type'].str.extract(r'(\w+)_', expand=False)
).clip(upper=1)
type LT MA PL
start end
10.10.80.10 10.55.10.1 0 1 1
10.100.80.10 10.55.10.1 0 0 1
12.100.90.10 100.100.12.1 1 0 1
EDIT:
pd.crosstab(
index=df['date'].dt.floor('1min'),
columns=[
df['start'],
df['end'],
df['type'].str.extract(r'(\w+)_', expand=False)
],
).astype(bool).sum().unstack(-1, fill_value=0)
type LT MA PL
start end
10.10.80.10 10.55.10.1 0 1 1
10.100.80.10 10.55.10.1 0 0 1
12.100.90.10 100.100.12.1 1 0 1
来源:https://stackoverflow.com/questions/66141707/how-to-get-frequency-count-of-column-values-for-each-unique-pair-of-columns-in-p