how to split and categorize value in a column of a pandas dataframe

问题

I have a df,

    keys
0   one
1   two,one
2   " "
3   five,one
4   " "
5   two,four
6   four
7   four,five

and two lists,

 actual=["one","two"]
 syn=["four","five"]

I am creating a new row df["val"] I am trrying to get the categories of the cells in df["keys"]. If anyone of the key is present in actual then i want to add actual in a new column but same row, If anyone of the value is not present in actual then i want the corresponding df["val"] as syn. and it should not do anything on the white space cells.

My desired output is,

output_df

    keys      val
0   one       actual
1   two,one   actual
2   " "        
3   five,one  actual
4   " "
5   two,four  actual
6   four      syn
7   four,five syn

Please help, thanks in advance!

回答1:

Use numpy.select with double conditions for check membership by compare sets:

s = df['keys'].str.split(',')
m1 = s.apply(set) & set(actual)
m2 = s.apply(set) & set(syn)

df['part'] = np.select([m1, m2], ['actual','syn'], default='')
print (df)
        keys    part
0        one  actual
1    two,one  actual
2                   
3   five,one  actual
4                   
5   two,four  actual
6       four     syn
7  four,five     syn

Timings:

df = pd.concat([df] * 10000, ignore_index=True)


In [143]: %%timeit 
     ...: s = df['keys'].str.split(',')
     ...: m1 = s.apply(set) & set(actual)
     ...: m2 = s.apply(set) & set(syn)
     ...: 
1 loop, best of 3: 160 ms per loop

#cᴏʟᴅsᴘᴇᴇᴅ' s solution
In [144]: %%timeit
     ...: v = df['keys'].str.split(',',expand=True)
     ...: m1 = v.isin(["one","two"]).any(1)
     ...: m2 = v.isin(["four","five"]).any(1)
     ...: 
1 loop, best of 3: 193 ms per loop

Caveat:

Performance really depends on the data.

回答2:

First, split on comma and expand words into their own cells.

v = df['keys'].str.split(',',expand=True)

Next, form two masks, one for actual and another for syn, using isin + any. These will be used to label the rows.

m1 = v.isin(["one","two"]).any(1)
m2 = v.isin(["four","five"]).any(1)

Finally, use np.select or np.where to label the rows based on the masks computed.

df['val'] = np.select([m1, m2], ['actual', 'syn'], default='')

Or,

df['val'] = np.where(m1, 'actual', np.where(m2, 'syn', ''))

df

        keys     val
0        one  actual
1    two,one  actual
2                   
3   five,one  actual
4                   
5   two,four  actual
6       four     syn
7  four,five     syn

Details

v

      0     1
0   one  None
1   two   one
2        None
3  five   one
4        None
5   two  four
6  four  None
7  four  five

m1

0     True
1     True
2    False
3     True
4    False
5     True
6    False
7    False
dtype: bool

m2

0    False
1    False
2    False
3     True
4    False
5     True
6     True
7     True
dtype: bool

np.select([m1, m2], ['actual', 'syn'], default='')
array(['actual', 'actual', '', 'actual', '', 'actual', 'syn', 'syn'],
      dtype='<U6')

来源：https://stackoverflow.com/questions/48380425/how-to-split-and-categorize-value-in-a-column-of-a-pandas-dataframe

标签

python

pandas

dataframe

data-analysis