Matching partial words in two different columns

前端 未结 3 1431
陌清茗
陌清茗 2021-01-24 01:29

I am working on trying to weed out a certain customer from our database. I\'ve noticed a trend where people fill out their first name with the same name that is partial to how t

3条回答
  •  既然无缘
    2021-01-24 01:35

    You can employ similarity based approach
    Try code at bottom of answer
    It produces result like below

    business_name           partial_business_name   first_name  similarity   
    locksmith taylorsville  locksmith               locksmith   1.0  
    locksmith farmington    locksmith               locksmith   1.0  
    locksmith roy           locksmith               locksmi     0.7777777777777778   
    locksmith clinton       locksmith               locks       0.5555555555555556   
    

    So, you will be able to control what to filter out based on similarity value

    ** Code **

    SELECT business_name, partial_business_name, first_name, similarity FROM 
    JS( // input table
    (
      SELECT business_name, REGEXP_EXTRACT(business_name, r'^(\w+)') AS partial_business_name, first_name AS first_name FROM 
        (SELECT 'locksmith taylorsville' AS business_name, 'locksmith' AS first_name),
        (SELECT 'locksmith roy' AS business_name, 'locksmi' AS first_name),
        (SELECT 'locksmith clinton' AS business_name, 'locks' AS first_name),
        (SELECT 'locksmith farmington' AS business_name, 'locksmith' AS first_name),
    ) ,
    // input columns
    business_name, partial_business_name, first_name,
    // output schema
    "[{name: 'business_name', type:'string'},
      {name: 'partial_business_name', type:'string'},
      {name: 'first_name', type:'string'},
      {name: 'similarity', type:'float'}]
    ",
    // function
    "function(r, emit) {
    
      var _extend = function(dst) {
        var sources = Array.prototype.slice.call(arguments, 1);
        for (var i=0; i tmp) {
                nextCol = tmp;
              }
              // deletion
              tmp = prevRow[j + 1] + 1;
              if (nextCol > tmp) {
                nextCol = tmp;
              }
    
              // copy current col value into previous (in preparation for next iteration)
              prevRow[j] = curCol;
            }
    
            // copy last col value into previous (in preparation for next iteration)
            prevRow[j] = nextCol;
          }
    
          return nextCol;
        }
    
      };
    
      var the_partial_business_name;
    
      try {
        the_partial_business_name = decodeURI(r.partial_business_name).toLowerCase();
      } catch (ex) {
        the_partial_business_name = r.partial_business_name.toLowerCase();
      }
    
      try {
        the_first_name = decodeURI(r.first_name).toLowerCase();
      } catch (ex) {
        the_first_name = r.first_name.toLowerCase();
      }
    
      emit({business_name: r.business_name, partial_business_name: the_partial_business_name, first_name: the_first_name,
            similarity: 1 - Levenshtein.get(the_partial_business_name, the_first_name) / the_partial_business_name.length});
    
      }"
    )
    ORDER BY similarity DESC
    

    Was used in How to perform trigram operations in Google BigQuery? and based on https://storage.googleapis.com/thomaspark-sandbox/udf-examples/pataky.js by @thomaspark where Levenshtein's distance is used to measure similarity

提交回复
热议问题