I have the following problem
I have a dataframe master that contains sentences, such as
master
Out[8]:
original
0
I'm working on something similar and I wanted to provide a more complete working solution for anyone else you might stumble upon this question. @MRocklin unfortunately has some syntax errors in the code snippets provided. I am no expert with Dask, so I can't comment on some performance considerations, but this should accomplish your task just as @MRocklin has suggested. This is using Dask version 0.17.2 and Pandas version 0.22.0:
import dask.dataframe as dd
import dask.multiprocessing
import dask.threaded
from fuzzywuzzy import fuzz
import pandas as pd
master= pd.DataFrame({'original':['this is a nice sentence',
'this is another one',
'stackoverflow is nice']})
slave= pd.DataFrame({'name':['hello world',
'congratulations',
'this is a nice sentence ',
'this is another one',
'stackoverflow is nice'],'my_value': [1,2,3,4,5]})
def fuzzy_score(str1, str2):
return fuzz.token_set_ratio(str1, str2)
def helper(orig_string, slave_df):
slave_df['score'] = slave_df.name.apply(lambda x: fuzzy_score(x,orig_string))
#return my_value corresponding to the highest score
return slave_df.loc[slave_df.score.idxmax(),'my_value']
dmaster = dd.from_pandas(master, npartitions=4)
dmaster['my_value'] = dmaster.original.apply(lambda x: helper(x, slave),meta=('x','f8'))
Then, obtain your results (like in this interpreter session):
In [6]: dmaster.compute(get=dask.multiprocessing.get)
Out[6]:
original my_value
0 this is a nice sentence 3
1 this is another one 4
2 stackoverflow is nice 5