问题
I have a df,
keys
0 one
1 two,one
2 " "
3 five,one
4 " "
5 two,four
6 four
7 four,five
and two lists,
actual=["one","two"]
syn=["four","five"]
I am creating a new row df["val"]
I am trrying to get the categories of the cells in df["keys"]
. If anyone of the key is present in actual
then i want to add actual in a new column but same row, If anyone of the value is not present in actual then i want the corresponding df["val"] as syn
. and it should not do anything on the white space cells.
My desired output is,
output_df
keys val
0 one actual
1 two,one actual
2 " "
3 five,one actual
4 " "
5 two,four actual
6 four syn
7 four,five syn
Please help, thanks in advance!
回答1:
Use numpy.select with double conditions for check membership by compare set
s:
s = df['keys'].str.split(',')
m1 = s.apply(set) & set(actual)
m2 = s.apply(set) & set(syn)
df['part'] = np.select([m1, m2], ['actual','syn'], default='')
print (df)
keys part
0 one actual
1 two,one actual
2
3 five,one actual
4
5 two,four actual
6 four syn
7 four,five syn
Timings:
df = pd.concat([df] * 10000, ignore_index=True)
In [143]: %%timeit
...: s = df['keys'].str.split(',')
...: m1 = s.apply(set) & set(actual)
...: m2 = s.apply(set) & set(syn)
...:
1 loop, best of 3: 160 ms per loop
#cᴏʟᴅsᴘᴇᴇᴅ' s solution
In [144]: %%timeit
...: v = df['keys'].str.split(',',expand=True)
...: m1 = v.isin(["one","two"]).any(1)
...: m2 = v.isin(["four","five"]).any(1)
...:
1 loop, best of 3: 193 ms per loop
Caveat:
Performance really depends on the data.
回答2:
First, split on comma and expand words into their own cells.
v = df['keys'].str.split(',',expand=True)
Next, form two masks, one for actual
and another for syn
, using isin
+ any
. These will be used to label the rows.
m1 = v.isin(["one","two"]).any(1)
m2 = v.isin(["four","five"]).any(1)
Finally, use np.select
or np.where
to label the rows based on the masks computed.
df['val'] = np.select([m1, m2], ['actual', 'syn'], default='')
Or,
df['val'] = np.where(m1, 'actual', np.where(m2, 'syn', ''))
df
keys val
0 one actual
1 two,one actual
2
3 five,one actual
4
5 two,four actual
6 four syn
7 four,five syn
Details
v
0 1
0 one None
1 two one
2 None
3 five one
4 None
5 two four
6 four None
7 four five
m1
0 True
1 True
2 False
3 True
4 False
5 True
6 False
7 False
dtype: bool
m2
0 False
1 False
2 False
3 True
4 False
5 True
6 True
7 True
dtype: bool
np.select([m1, m2], ['actual', 'syn'], default='')
array(['actual', 'actual', '', 'actual', '', 'actual', 'syn', 'syn'],
dtype='<U6')
来源:https://stackoverflow.com/questions/48380425/how-to-split-and-categorize-value-in-a-column-of-a-pandas-dataframe