I have the following problem
I have a dataframe master that contains sentences, such as
master
Out[8]:
original
0
You can parallelize this with Dask.dataframe.
>>> dmaster = dd.from_pandas(master, npartitions=4)
>>> dmaster['my_value'] = dmaster.original.apply(lambda x: helper(x, slave), name='my_value'))
>>> dmaster.compute()
original my_value
0 this is a nice sentence 2
1 this is another one 3
2 stackoverflow is nice 1
Additionally, you should think about the tradeoffs between using threads vs processes here. Your fuzzy string matching almost certainly doesn't release the GIL, so you won't get any benefit from using multiple threads. However, using processes will cause data to serialize and move around your machine, which might slow things down a bit.
You can experiment between using threads and processes or a distributed system by managing the get= keyword argument to the compute() method.
import dask.multiprocessing
import dask.threaded
>>> dmaster.compute(get=dask.threaded.get) # this is default for dask.dataframe
>>> dmaster.compute(get=dask.multiprocessing.get) # try processes instead