问题
I have a Dataframe that looks like below
data = [(datetime.datetime(2021, 2, 10, 7, 49, 7, 118658), u'12.100.90.10', u'100.100.12.1', u'LT_DOWN'),
(datetime.datetime(2021, 2, 10, 7, 49, 14, 312273), u'12.100.90.10', u'100.100.12.1', u'LT_UP'),
(datetime.datetime(2021, 2, 10, 7, 49, 21, 535932), u'12.100.90.10', u'100.100.12.1', u'LT_UP'),
(datetime.datetime(2021, 2, 10, 7, 50, 28, 725961), u'12.100.90.10', u'100.100.12.1', u'PL_DOWN'),
(datetime.datetime(2021, 2, 10, 7, 50, 32, 450853), u'10.100.80.10', u'10.55.10.1', u'PL_LOW'),
(datetime.datetime(2021, 2, 10, 7, 51, 32, 450853), u'10.10.80.10', u'10.55.10.1', u'MA_HIGH'),
(datetime.datetime(2021, 2, 10, 7, 52, 34, 264042), u'10.10.80.10', u'10.55.10.1', u'PL_DOWN')]
As you can see there is data getting logged per minute. I have just presented part of the complete data here.
This is how it looks on loading it in pandas
date start end type
0 2021-02-10 07:49:07.118658 12.100.90.10 100.100.12.1 LT_DOWN
1 2021-02-10 07:49:14.312273 12.100.90.10 100.100.12.1 LT_UP
2 2021-02-10 07:49:21.535932 12.100.90.10 100.100.12.1 LT_UP
3 2021-02-10 07:50:28.725961 12.100.90.10 100.100.12.1 PL_DOWN
4 2021-02-10 07:50:32.450853 10.100.80.10 10.55.10.1 PL_LOW
5 2021-02-10 07:51:32.450853 10.10.80.10 10.55.10.1 MA_HIGH
6 2021-02-10 07:52:34.264042 10.10.80.10 10.55.10.1 PL_DOWN
First, I want to get the count of each value in type
column on a minute basis (in values for column type
, only first part of _
split should be considered for count. So it would look something like
date LT PL MA
0 2021-02-10 07:49 3 0 0
1 2021-02-10 07:50 0 2 0
2 2021-02-10 07:51 0 0 1
3 2021-02-10 07:52 0 1 0
But the above data doesn't tell for every unique pair of start
and end
column values, what is the count for LT
, PL
, MA
(after split on _
).
Thanks to @Sayandip Dutta, he provided the below solution (https://stackoverflow.com/a/66136108/5550284)
pd.crosstab(
index=df['date'].dt.floor('1min'),
columns=[
df['start'].add('-').add(df['end']).rename('star-end'),
df['type'].str.extract(r'(\w+)_', expand=False)
],
dropna=False
)
Here is how the dataframe looks like
start-end 10.10.80.10-10.55.10.1 10.100.80.10-10.55.10.1 12.100.90.10-100.100.12.1
type LT MA PL LT MA PL LT MA PL
date
2021-02-10 07:49:00 0 0 0 0 0 0 3 0 0
2021-02-10 07:50:00 0 0 0 0 0 1 0 0 1
2021-02-10 07:51:00 0 1 0 0 0 0 0 0 0
2021-02-10 07:52:00 0 0 1 0 0 0 0 0 0
So on converting the above to boolean, it looks like below
start-end 10.10.80.10-10.55.10.1 10.100.80.10-10.55.10.1 12.100.90.10-100.100.12.1
type LT MA PL LT MA PL LT MA PL
date
2021-02-10 07:49:00 False False False False False False True False False
2021-02-10 07:50:00 False False False False False True False False True
2021-02-10 07:51:00 False True False False False False False False False
2021-02-10 07:52:00 False False True False False False False False False
Now I want to know, for every unique pair of start
and end
, what is total count of True
for LT
, MA
and PL
. So my final Dataframe should look like
start end LT MA PL
10.10.80.10 10.55.10.1 0 1 1
10.100.80.10 10.55.10.1 0 0 1
12.100.90.10 100.100.12.1 1 0 1
I just can't seem to figure out how do I extract the required information from the cross tab.
回答1:
You can use numpy.sign with pandas.crosstab:
>>> import numpy as np
>>> np.sign(
pd.crosstab(
index=[df['start'], df['end']],
columns=df['type'].str.extract(r'(\w+)_', expand=False)
)
)
type LT MA PL
start end
10.10.80.10 10.55.10.1 0 1 1
10.100.80.10 10.55.10.1 0 0 1
12.100.90.10 100.100.12.1 1 0 1
Or, instead of np.sign
you can use pandas.DataFrame.clip:
>>> pd.crosstab(
index=[df['start'], df['end']],
columns=df['type'].str.extract(r'(\w+)_', expand=False)
).clip(upper=1)
type LT MA PL
start end
10.10.80.10 10.55.10.1 0 1 1
10.100.80.10 10.55.10.1 0 0 1
12.100.90.10 100.100.12.1 1 0 1
EDIT:
pd.crosstab(
index=df['date'].dt.floor('1min'),
columns=[
df['start'],
df['end'],
df['type'].str.extract(r'(\w+)_', expand=False)
],
).astype(bool).sum().unstack(-1, fill_value=0)
type LT MA PL
start end
10.10.80.10 10.55.10.1 0 1 1
10.100.80.10 10.55.10.1 0 0 1
12.100.90.10 100.100.12.1 1 0 1
来源:https://stackoverflow.com/questions/66141707/how-to-get-frequency-count-of-column-values-for-each-unique-pair-of-columns-in-p