Find all matching groups in a list of lists with pandas

问题

I would like to find all cases for all ids in a Pandas DataFrame. What would be an efficient solution? I have around 10k of records and it is processed server-side. Would it be a good idea to create a new DataFrame, or is there a more efficient data structure I can use? A case is satisfied when an id contains all names in a case.

Input (Pandas DataFrame)

id | name |
-----------
1  | bla1 |
2  | bla2 |
2  | bla3 |
2  | bla4 |
3  | bla5 |
4  | bla9 |
5  | bla6 |
5  | bla7 |
6  | bla8 |

Cases

names [
  [bla2, bla3, bla4], #case 1
  [bla1, bla3, bla7], #case 2
  [bla3, bla1, bla6], #case 3
  [bla6, bla7] #case 4
]

Needed output (unless there is a more efficient way)

id | case1 | case2 | case3 | case4 |
------------------------------------
1  | 0     | 0     | 0     | 0     |
2  | 1     | 0     | 0     | 0     |
3  | 0     | 0     | 0     | 0     |
4  | 0     | 0     | 0     | 0     |
5  | 0     | 0     | 0     | 1     |
6  | 0     | 0     | 0     | 0     |

回答1:

names = [
   ['bla2', 'bla3', 'bla4'], # case 1
   ['bla1', 'bla3', 'bla7'], # case 2
   ['bla3', 'bla1', 'bla6'], # case 3
   ['bla6', 'bla7']          # case 4
]

df = df.groupby('id').apply(lambda x: \
                pd.Series([int(pd.Series(y).isin(x['name']).all()) for y in names]))\
       .rename(columns=lambda x: 'case{}'.format(x + 1))

df
+------+---------+---------+---------+---------+
|   id |   case1 |   case2 |   case3 |   case4 |
|------+---------+---------+---------+---------|
|    1 |       0 |       0 |       0 |       0 |
|    2 |       1 |       0 |       0 |       0 |
|    3 |       0 |       0 |       0 |       0 |
|    5 |       0 |       0 |       0 |       1 |
|    6 |       0 |       0 |       0 |       0 |
+------+---------+---------+---------+---------+

First, groupby id, and then apply apply a check successively on each case, for each group. The objective is to check whether all items in a group will match with a given case. This is handled by the isin in conjunction with the list comprehension. The outer pd.Series will expand the result to separate columns and df.rename is used to rename the columns.

来源：https://stackoverflow.com/questions/46274903/find-all-matching-groups-in-a-list-of-lists-with-pandas

标签

python

pandas

dataframe