how to parallelize many (fuzzy) string comparisons using apply in Pandas?

前端 未结 3 2105
半阙折子戏
半阙折子戏 2020-12-02 10:18

I have the following problem

I have a dataframe master that contains sentences, such as

master
Out[8]: 
                  original
0         


        
3条回答
  •  时光说笑
    2020-12-02 10:34

    You can parallelize this with Dask.dataframe.

    >>> dmaster = dd.from_pandas(master, npartitions=4)
    >>> dmaster['my_value'] = dmaster.original.apply(lambda x: helper(x, slave), name='my_value'))
    >>> dmaster.compute()
                      original  my_value
    0  this is a nice sentence         2
    1      this is another one         3
    2    stackoverflow is nice         1
    

    Additionally, you should think about the tradeoffs between using threads vs processes here. Your fuzzy string matching almost certainly doesn't release the GIL, so you won't get any benefit from using multiple threads. However, using processes will cause data to serialize and move around your machine, which might slow things down a bit.

    You can experiment between using threads and processes or a distributed system by managing the get= keyword argument to the compute() method.

    import dask.multiprocessing
    import dask.threaded
    
    >>> dmaster.compute(get=dask.threaded.get)  # this is default for dask.dataframe
    >>> dmaster.compute(get=dask.multiprocessing.get)  # try processes instead
    

提交回复
热议问题