Is there a way to measure string similarity in Google BigQuery

前端 未结 7 2418
礼貌的吻别
礼貌的吻别 2020-12-03 15:35

I\'m wondering if anyone knows of a way to measure string similarity in BigQuery.

Seems like would be a neat function to have.

My case is i need to compare

7条回答
  •  鱼传尺愫
    2020-12-03 16:12

    While I was looking for the answer Felipe above, I worked on my own query and ended up with two versions, one which I called string approximation and another string resemblance.

    The first is looking at the shortest distance between letters of source string and test string and returns a score between 0 and 1 where 1 is a complete match. It will always score based on the longest string of the two. It turns out to return similar results to the Levensthein distance.

    #standardSql
    CREATE OR REPLACE FUNCTION `myproject.func.stringApproximation`(sourceString STRING, testString STRING) AS (
    (select avg(best_result) from (
                                  select if(length(testString)

    The second is a variation of the first, where it will look at sequences of matching distances, so that a character matching at equal distance from the character preceding or following it will count as one point. This works quite well, better than string approximation but not quite as well as I would like to (see example output below).

        #standarSql
        CREATE OR REPLACE FUNCTION `myproject.func.stringResemblance`(sourceString STRING, testString STRING) AS (
    (
    select avg(sequence)
    from (
          select ref,
                 if(array_length(array(select * from comparison.collection intersect distinct
                                       (select * from comparison.before))) > 0
                        or array_length(array(select * from comparison.collection intersect distinct
                                              (select * from comparison.after))) > 0
                     , 1, 0) as sequence
    
          from (
                   select ref,
                          collection,
                          lag(collection) over (order by ref)  as before,
                          lead(collection) over (order by ref) as after
                   from (
                         select if(length(testString) < length(sourceString), sourceoffset, testoffset) as ref,
                                array_agg(result ignore nulls)                                          as collection
                         from (
                                  select *,
                                         if(source = test, abs(sourceoffset - (testoffset)), null) as result
                                  from unnest(split(lower(sourceString),'')) as source with offset sourceoffset
                                           cross join
                                       (select *
                                        from unnest(split(lower(testString),'')) as test with offset as testoffset)
                                  ) as results
                         group by ref
                            )
                   ) as comparison
          )
    
    )
    );
    

    Now here is a sample of result:

    #standardSQL
    with test_subjects as (
      select 'benji' as name union all
      select 'benjamin' union all
      select 'benjamin alan artis' union all
      select 'ben artis' union all
      select 'artis benjamin' 
    )
    
    select name, quick.stringApproximation('benjamin artis', name) as approxiamtion, quick.stringResemblance('benjamin artis', name) as resemblance
    from test_subjects
    
    order by resemblance desc
    

    This returns

    +---------------------+--------------------+--------------------+
    | name                | approximation      | resemblance        |
    +---------------------+--------------------+--------------------+
    | artis benjamin      | 0.2653061224489796 | 0.8947368421052629 |
    +---------------------+--------------------+--------------------+
    | benjamin alan artis | 0.6078947368421053 | 0.8947368421052629 |
    +---------------------+--------------------+--------------------+
    | ben artis           | 0.4142857142857142 | 0.7142857142857143 |
    +---------------------+--------------------+--------------------+
    | benjamin            | 0.6125850340136053 | 0.5714285714285714 |
    +---------------------+--------------------+--------------------+
    | benji               | 0.36269841269841263| 0.28571428571428575|
    +----------------------------------------------------------------
    

    Edited: updated the resemblance algorithm to improve results.

提交回复
热议问题