How to perform trigram operations in Google BigQuery?

谁说胖子不能爱 提交于 2019-12-04 04:44:18

问题


I do use the pg_trgm module in PostgreSQL to calculate similarity between two strings using trigrams. Particularly I use:

similarity(text, text)

Which returns returns a number that indicates how similar the two arguments are (between 0 and 1).

How can I perform similarity function (or equivalent) on Google BigQuery?


回答1:


Try below. At least as a blueprint for enhancing

SELECT text1, text2, similarity FROM 
JS(
// input table
(
  SELECT * FROM 
  (SELECT 'mikhail' AS text1, 'mikhail' AS text2),
  (SELECT 'mikhail' AS text1, 'mike' AS text2),
  (SELECT 'mikhail' AS text1, 'michael' AS text2),
  (SELECT 'mikhail' AS text1, 'javier' AS text2),
  (SELECT 'mikhail' AS text1, 'thomas' AS text2)
) ,
// input columns
text1, text2,
// output schema
"[{name: 'text1', type:'string'},
  {name: 'text2', type:'string'},
  {name: 'similarity', type:'float'}]
",
// function
"function(r, emit) {

  var _extend = function(dst) {
    var sources = Array.prototype.slice.call(arguments, 1);
    for (var i=0; i<sources.length; ++i) {
      var src = sources[i];
      for (var p in src) {
        if (src.hasOwnProperty(p)) dst[p] = src[p];
      }
    }
    return dst;
  };

  var Levenshtein = {
    /**
     * Calculate levenshtein distance of the two strings.
     *
     * @param str1 String the first string.
     * @param str2 String the second string.
     * @return Integer the levenshtein distance (0 and above).
     */
    get: function(str1, str2) {
      // base cases
      if (str1 === str2) return 0;
      if (str1.length === 0) return str2.length;
      if (str2.length === 0) return str1.length;

      // two rows
      var prevRow  = new Array(str2.length + 1),
          curCol, nextCol, i, j, tmp;

      // initialise previous row
      for (i=0; i<prevRow.length; ++i) {
        prevRow[i] = i;
      }

      // calculate current row distance from previous row
      for (i=0; i<str1.length; ++i) {
        nextCol = i + 1;

        for (j=0; j<str2.length; ++j) {
          curCol = nextCol;

          // substution
          nextCol = prevRow[j] + ( (str1.charAt(i) === str2.charAt(j)) ? 0 : 1 );
          // insertion
          tmp = curCol + 1;
          if (nextCol > tmp) {
            nextCol = tmp;
          }
          // deletion
          tmp = prevRow[j + 1] + 1;
          if (nextCol > tmp) {
            nextCol = tmp;
          }

          // copy current col value into previous (in preparation for next iteration)
          prevRow[j] = curCol;
        }

        // copy last col value into previous (in preparation for next iteration)
        prevRow[j] = nextCol;
      }

      return nextCol;
    }

  };

  var the_text1;

  try {
    the_text1 = decodeURI(r.text1).toLowerCase();
  } catch (ex) {
    the_text1 = r.text1.toLowerCase();
  }

  try {
    the_text2 = decodeURI(r.text2).toLowerCase();
  } catch (ex) {
    the_text2 = r.text2.toLowerCase();
  }

  emit({text1: the_text1, text2: the_text2,
        similarity: 1 - Levenshtein.get(the_text1, the_text2) / the_text1.length});

  }"
)
ORDER BY similarity DESC

This is light modification based on https://storage.googleapis.com/thomaspark-sandbox/udf-examples/pataky.js by @thomaspark



来源:https://stackoverflow.com/questions/34815207/how-to-perform-trigram-operations-in-google-bigquery

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!