how to parallelize many (fuzzy) string comparisons using apply in Pandas?

前端 未结 3 2106
半阙折子戏
半阙折子戏 2020-12-02 10:18

I have the following problem

I have a dataframe master that contains sentences, such as

master
Out[8]: 
                  original
0         


        
3条回答
  •  旧时难觅i
    2020-12-02 10:53

    I'm working on something similar and I wanted to provide a more complete working solution for anyone else you might stumble upon this question. @MRocklin unfortunately has some syntax errors in the code snippets provided. I am no expert with Dask, so I can't comment on some performance considerations, but this should accomplish your task just as @MRocklin has suggested. This is using Dask version 0.17.2 and Pandas version 0.22.0:

    import dask.dataframe as dd
    import dask.multiprocessing
    import dask.threaded
    from fuzzywuzzy import fuzz
    import pandas as pd
    
    master= pd.DataFrame({'original':['this is a nice sentence',
    'this is another one',
    'stackoverflow is nice']})
    
    slave= pd.DataFrame({'name':['hello world',
    'congratulations',
    'this is a nice sentence ',
    'this is another one',
    'stackoverflow is nice'],'my_value': [1,2,3,4,5]})
    
    def fuzzy_score(str1, str2):
        return fuzz.token_set_ratio(str1, str2)
    
    def helper(orig_string, slave_df):
        slave_df['score'] = slave_df.name.apply(lambda x: fuzzy_score(x,orig_string))
        #return my_value corresponding to the highest score
        return slave_df.loc[slave_df.score.idxmax(),'my_value']
    
    dmaster = dd.from_pandas(master, npartitions=4)
    dmaster['my_value'] = dmaster.original.apply(lambda x: helper(x, slave),meta=('x','f8'))
    

    Then, obtain your results (like in this interpreter session):

    In [6]: dmaster.compute(get=dask.multiprocessing.get)                                             
    Out[6]:                                          
                      original  my_value             
    0  this is a nice sentence         3             
    1      this is another one         4             
    2    stackoverflow is nice         5    
    

提交回复
热议问题