Python/Pandas: How to Match List of Strings with a DataFrame column

后端 未结 4 1827
离开以前
离开以前 2021-01-04 21:32

I want to compare two columnn -- Description and Employer. I want to see if any keywords in Employer are found in the Descriptio

相关标签:
4条回答
  • 2021-01-04 22:00

    Timing comparison for different solutions

    Let's prepare a bit bigger DF - 2.000 rows:

    In [3]: df = pd.concat([df] * 10**2, ignore_index=True)
    
    In [4]: df.shape
    Out[4]: (2000, 5)
    

    Solution 1: df.apply(..., axis=1):

    df["Text_Search"] = df.Employer.str.lower().str.split().map(set)
    
    In [15]: %%timeit
        ...: df.Description.str.lower().str.split().map(set).to_frame('desc') \
        ...:               .apply(lambda r: (df["Text_Search"] & r.desc).any(),
        ...:                      axis=1)
        ...:
    1 loop, best of 3: 5.06 s per loop
    

    Solution 2: CountVectorizer

    from sklearn.feature_extraction.text import CountVectorizer
    vect = CountVectorizer(min_df=1, lowercase=True)
    
    In [8]: %%timeit
       ...: X = vect.fit_transform(df['Employer'])
       ...: cols_emp = vect.get_feature_names()
       ...: X = vect.fit_transform(df['Description'])
       ...: cols_desc = vect.get_feature_names()
       ...: common_cols_idx = [i for i,col in enumerate(cols_desc) if col in cols_emp]
       ...: (X.toarray()[:, common_cols_idx] == 1).any(1)
       ...:
    10 loops, best of 3: 88.2 ms per loop
    

    Solution 3: df.apply(search_func, axis=1)

    df["Text_Search"] = df["Employer"].str.lower().str.split()
    
    In [12]: %%timeit
        ...: df.apply(search_func, axis=1)
        ...:
    1 loop, best of 3: 362 ms per loop
    

    NOTE: Solution 1 is too slow, so i will not "timeit" this solution for bigger DFs


    Comparison of df.apply(search_func, axis=1) and CountVectorizer for 20.000 rows DF:

    In [16]: df = pd.concat([df] * 10, ignore_index=True)
    
    In [17]: df.shape
    Out[17]: (20000, 6)
    
    In [20]: %%timeit
        ...: df.apply(search_func, axis=1)
        ...:
    1 loop, best of 3: 3.66 s per loop
    
    In [21]: %%timeit
        ...: X = vect.fit_transform(df['Employer'])
        ...: cols_emp = vect.get_feature_names()
        ...: X = vect.fit_transform(df['Description'])
        ...: cols_desc = vect.get_feature_names()
        ...: common_cols_idx = [i for i,col in enumerate(cols_desc) if col in cols_emp]
        ...: (X.toarray()[:, common_cols_idx] == 1).any(1)
        ...:
    1 loop, best of 3: 825 ms per loop
    

    Comparison of df.apply(search_func, axis=1) and CountVectorizer for 200.000 rows DF:

    In [22]: df = pd.concat([df] * 10, ignore_index=True)
    
    In [23]: df.shape
    Out[23]: (200000, 6)
    
    In [24]: %%timeit
        ...: df.apply(search_func, axis=1)
        ...:
    1 loop, best of 3: 36.8 s per loop
    
    In [25]: %%timeit
        ...: X = vect.fit_transform(df['Employer'])
        ...: cols_emp = vect.get_feature_names()
        ...: X = vect.fit_transform(df['Description'])
        ...: cols_desc = vect.get_feature_names()
        ...: common_cols_idx = [i for i,col in enumerate(cols_desc) if col in cols_emp]
        ...: (X.toarray()[:, common_cols_idx] == 1).any(1)
        ...:
    1 loop, best of 3: 8.28 s per loop
    

    Conclusion: CountVectorized solution is apporx. 4.44 times faster compared to df.apply(search_func, axis=1)

    0 讨论(0)
  • 2021-01-04 22:02

    Here is a readable solution using an individual search_func:

    def search_func(row):
        matches = [test_value in row["Description"].lower() 
                   for test_value in row["Text_Search"]]
    
        if any(matches):
            return "Yes"
        else:
            return "No"
    

    This function is then applied row-wise:

    # create example data
    df = pd.DataFrame({"Description": ["CANSEL SURVEY E PAY", "JX154 TFR?FR xxx8690"],
                       "Employer": ["Cansel Survey Equipment", "Cansel Survey Equipment"]})
    
    print(df)
        Description             Employer
    0   CANSEL SURVEY E PAY     Cansel Survey Equipment
    1   JX154 TFR?FR xxx8690    Cansel Survey Equipment
    
    # create text searches and match column
    df["Text_Search"] = df["Employer"].str.lower().str.split()
    df["Match"] = df.apply(search_func, axis=1)
    
    # show result
    print(df)
        Description             Employer                    Text_Search                     Match
    0   CANSEL SURVEY E PAY     Cansel Survey Equipment     [cansel, survey, equipment]     Yes
    1   JX154 TFR?FR xxx8690    Cansel Survey Equipment     [cansel, survey, equipment]     No
    
    0 讨论(0)
  • 2021-01-04 22:11

    Here is fast and memory-saving vectrorized solution, which uses sklearn.feature_extraction.text.CountVectorizer method:

    from sklearn.feature_extraction.text import CountVectorizer
    vect = CountVectorizer(min_df=1, lowercase=True)
    
    X = vect.fit_transform(df['Employer'])
    cols_emp = vect.get_feature_names()
    
    X = vect.fit_transform(df['Description'])
    cols_desc = vect.get_feature_names()
    
    common_cols_idx = [i for i,col in enumerate(cols_desc) if col in cols_emp]
    
    df['Match'] = (X.toarray()[:, common_cols_idx] == 1).any(1)
    

    Source DF:

    In [259]: df
    Out[259]:
              Date           Description   Amount  AutoNumber                 Employer
    0    3/17/2015  WW120 TFR?FR xxx8690   140.00       49246  Cansel Survey Equipment
    2    3/13/2015  JX154 TFR?FR xxx8690   150.00       49246  Cansel Survey Equipment
    5     3/6/2015   CANSEL SURVEY E PAY  1182.08       49246  Cansel Survey Equipment
    9     3/2/2015  UE200 TFR?FR xxx8690   180.00       49246  Cansel Survey Equipment
    10   2/27/2015  JH401 TFR?FR xxx8690   400.00       49246  Cansel Survey Equipment
    11   2/27/2015   CANSEL SURVEY E PAY   555.62       49246  Cansel Survey Equipment
    12   2/25/2015  HU204 TFR?FR xxx8690   200.00       49246  Cansel Survey Equipment
    13   2/23/2015  UQ263 TFR?FR xxx8690   102.00       49246  Cansel Survey Equipment
    14   2/23/2015  UT460 TFR?FR xxx8690   200.00       49246  Cansel Survey Equipment
    15   2/20/2015   CANSEL SURVEY E PAY  1222.05       49246  Cansel Survey Equipment
    17   2/17/2015  UO414 TFR?FR xxx8690   250.00       49246  Cansel Survey Equipment
    19   2/11/2015  HI540 TFR?FR xxx8690   130.00       49246  Cansel Survey Equipment
    20   2/11/2015  HQ010 TFR?FR xxx8690   177.00       49246  Cansel Survey Equipment
    21   2/10/2015  WU455 TFR?FR xxx8690   200.00       49246  Cansel Survey Equipment
    22    2/6/2015  JJ500 TFR?FR xxx8690   301.00       49246  Cansel Survey Equipment
    23    2/6/2015   CANSEL SURVEY E PAY  1182.08       49246  Cansel Survey Equipment
    24    2/5/2015  IR453 TFR?FR xxx8690   168.56       49246             Cansel IR453
    26    2/2/2015  RQ574 TFR?FR xxx8690   500.00       49246  Cansel Survey Equipment
    27    2/2/2015  UT022 TFR?FR xxx8690   850.00       49246  Cansel Survey Equipment
    28  12/31/2014  HU521 TFR?FR xxx8690   950.17       49246      Cansel Survey HU521
    

    Result:

    In [261]: df
    Out[261]:
              Date           Description   Amount  AutoNumber                 Employer  Match
    0    3/17/2015  WW120 TFR?FR xxx8690   140.00       49246  Cansel Survey Equipment  False
    2    3/13/2015  JX154 TFR?FR xxx8690   150.00       49246  Cansel Survey Equipment  False
    5     3/6/2015   CANSEL SURVEY E PAY  1182.08       49246  Cansel Survey Equipment   True
    9     3/2/2015  UE200 TFR?FR xxx8690   180.00       49246  Cansel Survey Equipment  False
    10   2/27/2015  JH401 TFR?FR xxx8690   400.00       49246  Cansel Survey Equipment  False
    11   2/27/2015   CANSEL SURVEY E PAY   555.62       49246  Cansel Survey Equipment   True
    12   2/25/2015  HU204 TFR?FR xxx8690   200.00       49246  Cansel Survey Equipment  False
    13   2/23/2015  UQ263 TFR?FR xxx8690   102.00       49246  Cansel Survey Equipment  False
    14   2/23/2015  UT460 TFR?FR xxx8690   200.00       49246  Cansel Survey Equipment  False
    15   2/20/2015   CANSEL SURVEY E PAY  1222.05       49246  Cansel Survey Equipment   True
    17   2/17/2015  UO414 TFR?FR xxx8690   250.00       49246  Cansel Survey Equipment  False
    19   2/11/2015  HI540 TFR?FR xxx8690   130.00       49246  Cansel Survey Equipment  False
    20   2/11/2015  HQ010 TFR?FR xxx8690   177.00       49246  Cansel Survey Equipment  False
    21   2/10/2015  WU455 TFR?FR xxx8690   200.00       49246  Cansel Survey Equipment  False
    22    2/6/2015  JJ500 TFR?FR xxx8690   301.00       49246  Cansel Survey Equipment  False
    23    2/6/2015   CANSEL SURVEY E PAY  1182.08       49246  Cansel Survey Equipment   True
    24    2/5/2015  IR453 TFR?FR xxx8690   168.56       49246             Cansel IR453   True
    26    2/2/2015  RQ574 TFR?FR xxx8690   500.00       49246  Cansel Survey Equipment  False
    27    2/2/2015  UT022 TFR?FR xxx8690   850.00       49246  Cansel Survey Equipment  False
    28  12/31/2014  HU521 TFR?FR xxx8690   950.17       49246      Cansel Survey HU521   True
    

    Some explanations:

    In [266]: cols_desc
    Out[266]:
    ['cansel',
     'fr',
     'hi540',
     'hq010',
     'hu204',
     'hu521',
     'ir453',
     'jh401',
     'jj500',
     'jx154',
     'pay',
     'rq574',
     'survey',
     'tfr',
     'ue200',
     'uo414',
     'uq263',
     'ut022',
     'ut460',
     'wu455',
     'ww120',
     'xxx8690']
    
    In [267]: cols_emp
    Out[267]: ['cansel', 'equipment', 'hu521', 'ir453', 'survey']
    
    In [268]: common_cols_idx = [i for i,col in enumerate(cols_desc) if col in cols_emp]
    
    In [269]: common_cols_idx
    Out[269]: [0, 5, 6, 12]
    
    In [270]: X.toarray()
    Out[270]:
    array([[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1],
           [0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1],
           [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
           [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1],
           [0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1],
           [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
           [0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1],
           [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1],
           [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1],
           [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
           [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1],
           [0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1],
           [0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1],
           [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1],
           [0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1],
           [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
           [0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1],
           [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1],
           [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1],
           [0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1]], dtype=int64)
    
    In [271]: X.toarray()[:, common_cols_idx]
    Out[271]:
    array([[0, 0, 0, 0],
           [0, 0, 0, 0],
           [1, 0, 0, 1],
           [0, 0, 0, 0],
           [0, 0, 0, 0],
           [1, 0, 0, 1],
           [0, 0, 0, 0],
           [0, 0, 0, 0],
           [0, 0, 0, 0],
           [1, 0, 0, 1],
           [0, 0, 0, 0],
           [0, 0, 0, 0],
           [0, 0, 0, 0],
           [0, 0, 0, 0],
           [0, 0, 0, 0],
           [1, 0, 0, 1],
           [0, 0, 1, 0],
           [0, 0, 0, 0],
           [0, 0, 0, 0],
           [0, 1, 0, 0]], dtype=int64)
    
    In [272]: X.toarray()[:, common_cols_idx] == 1
    Out[272]:
    array([[False, False, False, False],
           [False, False, False, False],
           [ True, False, False,  True],
           [False, False, False, False],
           [False, False, False, False],
           [ True, False, False,  True],
           [False, False, False, False],
           [False, False, False, False],
           [False, False, False, False],
           [ True, False, False,  True],
           [False, False, False, False],
           [False, False, False, False],
           [False, False, False, False],
           [False, False, False, False],
           [False, False, False, False],
           [ True, False, False,  True],
           [False, False,  True, False],
           [False, False, False, False],
           [False, False, False, False],
           [False,  True, False, False]], dtype=bool)
    
    In [273]: (X.toarray()[:, common_cols_idx] == 1).any(1)
    Out[273]: array([False, False,  True, False, False,  True, False, False, False,  True, False, False, False, False, False,  True,  True, Fals
    e, False,  True], dtype=bool)
    
    0 讨论(0)
  • 2021-01-04 22:24

    Here is one solution, which splits text into lower-case sets and uses sets intersection for each row:

    In [160]: x['Match'] = x.Description.str.lower().str.split().map(set).to_frame('desc') \
         ...:               .apply(lambda r: (x.Employer.str.lower().str.split().map(set) & r.desc).any(),
         ...:                      axis=1)
         ...:
    
    In [161]: x
    Out[161]:
              Date           Description   Amount  AutoNumber                 Employer  Match
    0    3/17/2015  WW120 TFR?FR xxx8690   140.00       49246  Cansel Survey Equipment  False
    2    3/13/2015  JX154 TFR?FR xxx8690   150.00       49246  Cansel Survey Equipment  False
    5     3/6/2015   CANSEL SURVEY E PAY  1182.08       49246  Cansel Survey Equipment   True
    9     3/2/2015  UE200 TFR?FR xxx8690   180.00       49246  Cansel Survey Equipment  False
    10   2/27/2015  JH401 TFR?FR xxx8690   400.00       49246  Cansel Survey Equipment  False
    11   2/27/2015   CANSEL SURVEY E PAY   555.62       49246  Cansel Survey Equipment   True
    12   2/25/2015  HU204 TFR?FR xxx8690   200.00       49246  Cansel Survey Equipment  False
    13   2/23/2015  UQ263 TFR?FR xxx8690   102.00       49246  Cansel Survey Equipment  False
    14   2/23/2015  UT460 TFR?FR xxx8690   200.00       49246  Cansel Survey Equipment  False
    15   2/20/2015   CANSEL SURVEY E PAY  1222.05       49246  Cansel Survey Equipment   True
    17   2/17/2015  UO414 TFR?FR xxx8690   250.00       49246  Cansel Survey Equipment  False
    19   2/11/2015  HI540 TFR?FR xxx8690   130.00       49246  Cansel Survey Equipment  False
    20   2/11/2015  HQ010 TFR?FR xxx8690   177.00       49246  Cansel Survey Equipment  False
    21   2/10/2015  WU455 TFR?FR xxx8690   200.00       49246  Cansel Survey Equipment  False
    22    2/6/2015  JJ500 TFR?FR xxx8690   301.00       49246  Cansel Survey Equipment  False
    23    2/6/2015   CANSEL SURVEY E PAY  1182.08       49246  Cansel Survey Equipment   True
    24    2/5/2015  IR453 TFR?FR xxx8690   168.56       49246  Cansel Survey Equipment  False
    26    2/2/2015  RQ574 TFR?FR xxx8690   500.00       49246  Cansel Survey Equipment  False
    27    2/2/2015  UT022 TFR?FR xxx8690   850.00       49246  Cansel Survey Equipment  False
    28  12/31/2014  HU521 TFR?FR xxx8690   950.17       49246  Cansel Survey Equipment  False
    

    PS it's pretty slow as it's using not vectorized .apply(..., axis=1) method

    0 讨论(0)
提交回复
热议问题