问题
I would like to find all cases for all ids in a Pandas DataFrame. What would be an efficient solution? I have around 10k of records and it is processed server-side. Would it be a good idea to create a new DataFrame, or is there a more efficient data structure I can use? A case is satisfied when an id contains all names in a case.
Input (Pandas DataFrame)
id | name |
-----------
1 | bla1 |
2 | bla2 |
2 | bla3 |
2 | bla4 |
3 | bla5 |
4 | bla9 |
5 | bla6 |
5 | bla7 |
6 | bla8 |
Cases
names [
[bla2, bla3, bla4], #case 1
[bla1, bla3, bla7], #case 2
[bla3, bla1, bla6], #case 3
[bla6, bla7] #case 4
]
Needed output (unless there is a more efficient way)
id | case1 | case2 | case3 | case4 |
------------------------------------
1 | 0 | 0 | 0 | 0 |
2 | 1 | 0 | 0 | 0 |
3 | 0 | 0 | 0 | 0 |
4 | 0 | 0 | 0 | 0 |
5 | 0 | 0 | 0 | 1 |
6 | 0 | 0 | 0 | 0 |
回答1:
names = [
['bla2', 'bla3', 'bla4'], # case 1
['bla1', 'bla3', 'bla7'], # case 2
['bla3', 'bla1', 'bla6'], # case 3
['bla6', 'bla7'] # case 4
]
df = df.groupby('id').apply(lambda x: \
pd.Series([int(pd.Series(y).isin(x['name']).all()) for y in names]))\
.rename(columns=lambda x: 'case{}'.format(x + 1))
df
+------+---------+---------+---------+---------+
| id | case1 | case2 | case3 | case4 |
|------+---------+---------+---------+---------|
| 1 | 0 | 0 | 0 | 0 |
| 2 | 1 | 0 | 0 | 0 |
| 3 | 0 | 0 | 0 | 0 |
| 5 | 0 | 0 | 0 | 1 |
| 6 | 0 | 0 | 0 | 0 |
+------+---------+---------+---------+---------+
First, groupby
id
, and then apply apply a check successively on each case, for each group. The objective is to check whether all items in a group will match with a given case. This is handled by the isin
in conjunction with the list comprehension. The outer pd.Series
will expand the result to separate columns and df.rename
is used to rename the columns.
来源:https://stackoverflow.com/questions/46274903/find-all-matching-groups-in-a-list-of-lists-with-pandas