Is there a way to filter a django queryset based on string similarity (a la python difflib)?

后端 未结 3 1459
陌清茗
陌清茗 2020-12-31 18:17

I have a need to match cold leads against a database of our clients.

The leads come from a third party provider in bulk (thousands of records) and sales is asking u

相关标签:
3条回答
  • 2020-12-31 18:38

    soundex won't help you, because it's a phonetic algorithm. Joe and Joseph aren't similar phonetically, so soundex won't mark them as similar.

    You can try Levenshtein distance, which is implemented in PostgreSQL. Maybe in your database too and if not, you should be able to write a stored procedure, which will calculate the distance between two strings and use it in your computation.

    0 讨论(0)
  • 2020-12-31 18:41

    It's possible with trigram_similar lookups since Django 1.10, see docs for PostgreSQL specific lookups and Full text search

    0 讨论(0)
  • 2020-12-31 18:42

    If you need getting there with django and postgres and don't want to use introduced in 1.10 trigram-similarity https://docs.djangoproject.com/en/2.0/ref/contrib/postgres/lookups/#trigram-similarity you can implement using Levensthein like these:

    Extension needed fuzzystrmatch

    you need adding postgres extension to your db in psql:

    CREATE EXTENSION fuzzystrmatch;
    

    Lets define custom function with wich we can annotate queryset. It just take one argument the search_term and uses postgres levenshtein function (see docs):

    from django.db.models import Func
    
    class Levenshtein(Func):
        template = "%(function)s(%(expressions)s, '%(search_term)s')"
        function = "levenshtein"
    
        def __init__(self, expression, search_term, **extras):
            super(Levenshtein, self).__init__(
                expression,
                search_term=search_term,
                **extras
            )
    

    then in any other place in project we just import defined Levenshtein and F to pass the django field.

    from django.db.models import F
    
    Spot.objects.annotate(
        lev_dist=Levenshtein(F('name'), 'Kfaka')
    ).filter(
        lev_dist__lte=2
    )
    
    0 讨论(0)
提交回复
热议问题