how can I match all the key value pair in python which running too long

前端 未结 2 1721
挽巷
挽巷 2020-12-22 03:23

User-item affinity and recommendations :
I am creating a table which suggests \"customers who bought this item also bought algorithm \"
Input dataset

<
相关标签:
2条回答
  • 2020-12-22 03:39

    Yes, algorithm could be improved. You are recalculating user list for items in inside loop multiple times. You can just get a dictionary of item and their users outside loops.

    # get unique items
    items = set(main.productId)
    
    n_users = len(set(main.userId))
    
    # make a dictionary of item and users who bought that item
    item_users = main.groupby('productId')['userId'].apply(set).to_dict()
    
    # iterate over combinations of item1 and item2 and store scores
    result = []
    for item1, item2 in itertools.combinations(items, 2):
    
      score = len(item_users[item1] & item_users[item2]) / n_users
      item_tuples = [(item1, item2), (item2, item1)]
      result.append((item1, item2, score))
      result.append((item2, item1, score)) # store score for reverse order as well
    
    # convert results to a dataframe
    result = pd.DataFrame(result, columns=["item1", "item2", "score"])
    

    Timing differences:

    Original implementation from question

    # 3 loops, best of 3: 41.8 ms per loop

    Mark's Method 2

    # 3 loops, best of 3: 19.9 ms per loop

    Implementation in this answer

    # 3 loops, best of 3: 3.01 ms per loop

    0 讨论(0)
  • 2020-12-22 03:42

    The key here is to create a cartesian product of productId. See code below,

    Method 1(works with smaller dataset)

    result=(main.drop_duplicates(['productId','userId'])
                .assign(cartesian_key=1)
                .pipe(lambda x:x.merge(x,on='cartesian_key'))
                .drop('cartesian_key',axis=1)
                .loc[lambda x:(x.productId_x!=x.productId_y) & (x.userId_x==x.userId_y)]
                .groupby(['productId_x','productId_y']).size()
                .div(data['userId'].nunique()))
    
    result
    
    Prod1   prod2   0.75
    Prod1   prod3   0.75
    Prod1   prod4   0.75
    Prod1   prod5   0.5
    prod2   Prod1   0.75
    prod2   prod3   0.5
    prod2   prod4   0.5
    prod2   prod5   0.25
    prod3   Prod1   0.75
    prod3   prod2   0.5
    prod3   prod4   0.5
    prod3   prod5   0.5
    prod4   Prod1   0.75
    prod4   prod2   0.5
    prod4   prod3   0.5
    prod4   prod5   0.5
    prod5   Prod1   0.5
    prod5   prod2   0.25
    prod5   prod3   0.5
    prod5   prod4   0.5
    
    

    Method 2

    result = (df.groupby(['productId','userId']).size()
                .clip(upper=1)
                .unstack()
                .assign(key=1)
                .reset_index()
                .pipe(lambda x:x.merge(x,on='key'))
                .drop('key',axis=1)
                .loc[lambda x:(x.productId_x!=x.productId_y)]
                .set_index(['productId_x','productId_y'])
                .pipe(lambda x:x.set_axis(x.columns.str.split('_',expand=True),axis=1,inplace=False))
                .swaplevel(axis=1)
                .pipe(lambda x:(x['x']+x['y']))
                .fillna(0)
                .div(2) 
                .mean(axis=1))
    
    0 讨论(0)
提交回复
热议问题