问题
I have an input dataframe like below, where 'ID' is unique identifier, 'Signals_in_Group' is a derived field containing list of all unique 'Signal' column values present in a 'Group'. And 'Signals_Count' is also a derived field whose values are count of items in 'Signals_in_Group'.
groups_df
ID Timestamp Signal Group Signals_in_Group Signals_Count
1 5 1590662170 A 1 [A, B, C] 3
2 2 1590662169 B 1 [A, B, C] 3
3 6 1590662169 C 1 [A, B, C] 3
4 8 1590662171 D 2 [A, D] 2
5 7 1590662172 A 2 [A, D] 2
6 10 1590662185 B 3 [A, B, C, D] 4
7 9 1590662185 D 3 [A, B, C, D] 4
8 3 1590662188 C 3 [A, B, C, D] 4
9 1 1590662186 D 3 [A, B, C, D] 4
10 11 1590662189 A 3 [A, B, C, D] 4
11 4 1590662192 C 4 [C, D] 2
12 12 1590662192 D 4 [C, D] 2
13 13 1590662204 B 5 [B, C] 2
14 14 1590662204 C 5 [B, C] 2
15 15 1590662204 B 5 [B, C] 2
Below is another input, which is a list of lists
clusters = [['A', 'B'], ['B', 'C'], ['A', 'D'], ['A', 'B', 'C'], ['B', 'C', 'D'], ['A', 'B', 'C', 'D'], ['C', 'D', 'E', 'F']]
I need to find whether in each group 'Signals_in_Group' contains any of the clusters. And for each cluster matched in a group, find the first occurring 'Signal' based on 'Timestamp'. If 'Timestamp' is same for more than 1 row, consider the 'Signal' having the lowest 'ID'.
Example for Group 1: 'Signals_in_Group' ([A, B, C]) contains 'clusters' [A, B], [B, C] and [A, B, C]. For cluster [A, B] in 'Group' 1, rows with index 1 and 2 match. Row 2 has the least 'Timestamp' among them, so the corresponding 'Signal' value 'B' becomes the output. For cluster [B, C] in 'Group' 1, rows with index 2 and 3 match. Both of them have same timestamp, so find the lowest 'ID' among them which is 2, and corresponding 'Signal' value 'B' becomes the output. For cluster [A, B, C] in 'Group' 1, rows with index 1, 2 and 3 match. Rows 2 and 3 have lowest and same timestamp, so find the lowest 'ID' among them which is 2, and corresponding 'Signal' value 'B' becomes the output. Likewise, this should be done for all groups.
The output should look like below: Each item in 'clusters' become the column names and each row is one 'Group'.
Group
A,B B,C A,D A,B,C B,C,D A,B,C,D C,D,E,F
1 B B NaN B NaN NaN NaN
2 NaN NaN D NaN NaN NaN NaN
3 B B D B D D NaN
5 NaN B NaN NaN NaN NaN NaN
I achieved this using the code below, first by iterating groups and then for each group, iterating clusters. However, it takes too long to complete. So, I'm looking for a more Pythonic and optimized solution to make it faster. I tested for 763k rows with 52k groups, number of clusters are 200. It took around 4 hrs.
Any suggestion to improve the runtime would be appreciated. Thanks.
cls = [','.join(ele) for ele in clusters]
cls.insert(0, 'Group')
result = pd.DataFrame(columns=cls)
result.set_index('Group', inplace = True)
groups = groups_df['Group'].unique()
for group in groups:
# Get all records belonging to the group
group_df = groups_df.groupby(['Group']).get_group(group)
# Remove clusters containing no. of items less than no. of items in 'Signals_in_Group'
clusters_fil = [x for x in clusters if len(x) <= group_df['Signals_Count'].iloc[0]]
for cluster in clusters_fil:
if all(elem in group_df['Signals_in_Group'].iloc[0] for elem in cluster):
cluster_df = group_df[group_df['Signal'].isin(cluster)]
inter = cluster_df.loc[cluster_df['Timestamp'] == cluster_df['Timestamp'].min()]
result.loc[group, ','.join(cluster)] = inter.loc[inter.ID == inter.ID.min(), 'Signal'].iat[0]
来源:https://stackoverflow.com/questions/62069465/match-lists-and-get-value-of-one-column-based-on-values-of-other-columns-from-da