Match lists and get value of one column based on values of other columns from dataframe optimization

问题

I have an input dataframe like below, where 'ID' is unique identifier, 'Signals_in_Group' is a derived field containing list of all unique 'Signal' column values present in a 'Group'. And 'Signals_Count' is also a derived field whose values are count of items in 'Signals_in_Group'.

groups_df
    ID  Timestamp   Signal  Group   Signals_in_Group    Signals_Count
1   5   1590662170  A       1       [A, B, C]           3
2   2   1590662169  B       1       [A, B, C]           3
3   6   1590662169  C       1       [A, B, C]           3
4   8   1590662171  D       2       [A, D]              2
5   7   1590662172  A       2       [A, D]              2
6   10  1590662185  B       3       [A, B, C, D]        4
7   9   1590662185  D       3       [A, B, C, D]        4
8   3   1590662188  C       3       [A, B, C, D]        4
9   1   1590662186  D       3       [A, B, C, D]        4
10  11  1590662189  A       3       [A, B, C, D]        4
11  4   1590662192  C       4       [C, D]              2
12  12  1590662192  D       4       [C, D]              2
13  13  1590662204  B       5       [B, C]              2
14  14  1590662204  C       5       [B, C]              2
15  15  1590662204  B       5       [B, C]              2

Below is another input, which is a list of lists

clusters = [['A', 'B'], ['B', 'C'], ['A', 'D'], ['A', 'B', 'C'], ['B', 'C', 'D'], ['A', 'B', 'C', 'D'], ['C', 'D', 'E', 'F']]

I need to find whether in each group 'Signals_in_Group' contains any of the clusters. And for each cluster matched in a group, find the first occurring 'Signal' based on 'Timestamp'. If 'Timestamp' is same for more than 1 row, consider the 'Signal' having the lowest 'ID'.

Example for Group 1: 'Signals_in_Group' ([A, B, C]) contains 'clusters' [A, B], [B, C] and [A, B, C]. For cluster [A, B] in 'Group' 1, rows with index 1 and 2 match. Row 2 has the least 'Timestamp' among them, so the corresponding 'Signal' value 'B' becomes the output. For cluster [B, C] in 'Group' 1, rows with index 2 and 3 match. Both of them have same timestamp, so find the lowest 'ID' among them which is 2, and corresponding 'Signal' value 'B' becomes the output. For cluster [A, B, C] in 'Group' 1, rows with index 1, 2 and 3 match. Rows 2 and 3 have lowest and same timestamp, so find the lowest 'ID' among them which is 2, and corresponding 'Signal' value 'B' becomes the output. Likewise, this should be done for all groups.

The output should look like below: Each item in 'clusters' become the column names and each row is one 'Group'.

Group
        A,B     B,C     A,D     A,B,C   B,C,D   A,B,C,D     C,D,E,F
1       B       B       NaN     B       NaN     NaN         NaN
2       NaN     NaN     D       NaN     NaN     NaN         NaN
3       B       B       D       B       D       D           NaN
5       NaN     B       NaN     NaN     NaN     NaN         NaN

I achieved this using the code below, first by iterating groups and then for each group, iterating clusters. However, it takes too long to complete. So, I'm looking for a more Pythonic and optimized solution to make it faster. I tested for 763k rows with 52k groups, number of clusters are 200. It took around 4 hrs.

Any suggestion to improve the runtime would be appreciated. Thanks.

cls = [','.join(ele) for ele in clusters]
cls.insert(0, 'Group')
result = pd.DataFrame(columns=cls)
result.set_index('Group', inplace = True)

groups = groups_df['Group'].unique()

for group in groups:

    # Get all records belonging to the group
    group_df = groups_df.groupby(['Group']).get_group(group)

    # Remove clusters containing no. of items less than no. of items in 'Signals_in_Group'
    clusters_fil = [x for x in clusters if len(x) <= group_df['Signals_Count'].iloc[0]]

    for cluster in clusters_fil:

        if all(elem in group_df['Signals_in_Group'].iloc[0] for elem in cluster):

            cluster_df = group_df[group_df['Signal'].isin(cluster)]
            inter = cluster_df.loc[cluster_df['Timestamp'] == cluster_df['Timestamp'].min()]
            result.loc[group, ','.join(cluster)] = inter.loc[inter.ID == inter.ID.min(), 'Signal'].iat[0]

来源：https://stackoverflow.com/questions/62069465/match-lists-and-get-value-of-one-column-based-on-values-of-other-columns-from-da

标签

python

pandas

list

dataframe

optimization