Fuzzy string matching in Python

前端 未结 3 1367
暗喜
暗喜 2020-12-23 23:32

I have 2 lists of over a million names with slightly different naming conventions. The goal here it to match those records that are similar, with the logic of 95% confidence

3条回答
  •  谎友^
    谎友^ (楼主)
    2020-12-24 00:04

    Specific to fuzzywuzzy, note that currently process.extractOne defaults to WRatio which is by far the slowest of their algorithms, and processor defaults to utils.full_process. If you pass in say fuzz.QRatio as your scorer it will go much quicker, but not as powerful depending on what you're trying to match. May be just fine for names though. I personally have good luck with token_set_ratio which is at least somewhat quicker than WRatio. You can also run utils.full_process() on all your choices beforehand and then run it with fuzz.ratio as your scorer and processor=None to skip the processing step. (see below) If you're just using the basic ratio function fuzzywuzzy is probably overkill though. Fwiw I have a JavaScript port (fuzzball.js) where you can pre-calculate the token sets too and use those instead of recalculating each time.)

    This doesn't cut down the sheer number of comparisons but it helps. (BK-tree for this possibly? Been looking into same situation myself)

    Also be sure to have python-Levenshtein installed so you use the faster calculation.

    **The behavior below may change, open issues under discussion etc.**

    fuzz.ratio doesn't run full process, and the token_set and token_sort functions accept a full_process=False param, and If you don't set Processor=None the extract function will try to run full process anyway. Can use functools' partial to say pass in fuzz.token_set_ratio with full_process=False as your scorer, and run utils.full_process on your choices beforehand.

提交回复
热议问题