I'm wondering if anyone knows of a way to measure string similarity in BigQuery.
Seems like would be a neat function to have.
My case is i need to compare the similarity of two urls as want to be fairly sure they refer to the same article.
I can find examples using javascript so maybe a UDF is the way to go but i've not used UDF's at all (or javascript for that matter :) )
Just wondering if there may be a way using existing regex functions or if anyone might be able to get me started with porting the javascript example into a UDF.
Any help much appreciated, thanks
EDIT: Adding some example code
So if i have a UDF defined as:
// distance function
function levenshteinDistance (row, emit) {
//if (row.inputA.length <= 0 ) {var myresult = row.inputB.length};
if (typeof row.inputA === 'undefined') {var myresult = 1};
if (typeof row.inputB === 'undefined') {var myresult = 1};
//if (row.inputB.length <= 0 ) {var myresult = row.inputA.length};
var myresult = Math.min(
levenshteinDistance(row.inputA.substr(1), row.inputB) + 1,
levenshteinDistance(row.inputB.substr(1), row.inputA) + 1,
levenshteinDistance(row.inputA.substr(1), row.inputB.substr(1)) + (row.inputA[0] !== row.inputB[0] ? 1 : 0)
) + 1;
emit({outputA: myresult})
}
bigquery.defineFunction(
'levenshteinDistance', // Name of the function exported to SQL
['inputA', 'inputB'], // Names of input columns
[{'name': 'outputA', 'type': 'integer'}], // Output schema
levenshteinDistance // Reference to JavaScript UDF
);
// make a test function to test individual parts
function test(row, emit) {
if (row.inputA.length <= 0) { var x = row.inputB.length} else { var x = row.inputA.length};
emit({outputA: x});
}
bigquery.defineFunction(
'test', // Name of the function exported to SQL
['inputA', 'inputB'], // Names of input columns
[{'name': 'outputA', 'type': 'integer'}], // Output schema
test // Reference to JavaScript UDF
);
Any i try test with a query such as:
SELECT outputA FROM (levenshteinDistance(SELECT "abc" AS inputA, "abd" AS inputB))
I get error:
Error: TypeError: Cannot read property 'substr' of undefined at line 11, columns 38-39 Error Location: User-defined function
It seems like maybe row.inputA is not a string perhaps or for some reason string functions not able to work on it. Not sure if this is a type issue or something funny about what utils the UDF is able to use by default.
Again any help much appreciated, thanks.
Ready to use shared UDFs - Levenshtein distance:
SELECT fhoffa.x.levenshtein('felipe', 'hoffa')
, fhoffa.x.levenshtein('googgle', 'goggles')
, fhoffa.x.levenshtein('is this the', 'Is This The')
6 2 0
Soundex:
SELECT fhoffa.x.soundex('felipe')
, fhoffa.x.soundex('googgle')
, fhoffa.x.soundex('guugle')
F410 G240 G240
Fuzzy choose one:
SELECT fhoffa.x.fuzzy_extract_one('jony'
, (SELECT ARRAY_AGG(name)
FROM `fh-bigquery.popular_names.gender_probabilities`)
#, ['john', 'johnny', 'jonathan', 'jonas']
)
johnny
How-to:
Levenshtein via JS would be the way to go. You can use the algorithm to get absolute string distance, or convert it to a percentage similarity by simply calculating abs(strlen - distance / strlen).
The easiest way to implement this would be to define a Levenshtein UDF that takes two inputs, a and b, and calculates the distance between them. The function could return a, b, and the distance.
To invoke it, you'd then pass in the two URLs as columns aliased to 'a' and 'b':
SELECT a, b, distance
FROM
Levenshtein(
SELECT
some_url AS a, other_url AS b
FROM
your_table
)
If you're familiar with Python, you can use the functions defined by fuzzywuzzy in BigQuery using external libraries loaded from GCS.
Steps:
- Download the javascript version of fuzzywuzzy (fuzzball)
- Take the compiled file of the library: dist/fuzzball.umd.min.js and rename it to a clearer name (like
fuzzball
) - Upload it to a google cloud storage bucket
- Create a temp function to use the lib in your query (set the path in OPTIONS to the relevant path)
CREATE TEMP FUNCTION token_set_ratio(a STRING, b STRING)
RETURNS FLOAT64
LANGUAGE js AS """
return fuzzball.token_set_ratio(a, b);
"""
OPTIONS (
library="gs://my-bucket/fuzzball.js");
with data as (select "my_test_string" as a, "my_other_string" as b)
SELECT a, b, token_set_ratio(a, b) from data
I couldn't find a direct answer to this, so I propose this solution, in standard SQL
#standardSQL
CREATE TEMP FUNCTION HammingDistance(a STRING, b STRING) AS (
(
SELECT
SUM(counter) AS diff
FROM (
SELECT
CASE
WHEN X.value != Y.value THEN 1
ELSE 0
END AS counter
FROM (
SELECT
value,
ROW_NUMBER() OVER() AS row
FROM
UNNEST(SPLIT(a, "")) AS value ) X
JOIN (
SELECT
value,
ROW_NUMBER() OVER() AS row
FROM
UNNEST(SPLIT(b, "")) AS value ) Y
ON
X.row = Y.row )
)
);
WITH Input AS (
SELECT 'abcdef' AS strings UNION ALL
SELECT 'defdef' UNION ALL
SELECT '1bcdef' UNION ALL
SELECT '1bcde4' UNION ALL
SELECT '123de4' UNION ALL
SELECT 'abc123'
)
SELECT strings, 'abcdef' as target, HammingDistance('abcdef', strings) as hamming_distance
FROM Input;
Compared to other solutions (like this one), it takes two strings (of the same length, following the definition for hamming distance) and outputs the expected distance.
Below is quite simpler version for Hamming Distance by using WITH OFFSET
instead of ROW_NUMBER() OVER()
#standardSQL
WITH Input AS (
SELECT 'abcdef' AS strings UNION ALL
SELECT 'defdef' UNION ALL
SELECT '1bcdef' UNION ALL
SELECT '1bcde4' UNION ALL
SELECT '123de4' UNION ALL
SELECT 'abc123'
)
SELECT 'abcdef' AS target, strings,
(SELECT COUNT(1)
FROM UNNEST(SPLIT('abcdef', '')) a WITH OFFSET x
JOIN UNNEST(SPLIT(strings, '')) b WITH OFFSET y
ON x = y AND a != b) hamming_distance
FROM Input
Try Flookup for Google Sheets... it's definitely faster than Levenshtein distance and it calculates percentage similarities right out of the box. One Flookup function you might find useful is this:
FUZZYMATCH (string1, string2)
Parameter Details
- string1: compares to string2.
- string2: compares to string1.
The percentage similarity is then calculated based on these comparisons. Both parameters can be ranges.
I'm currently trying to optimise it for large data sets so you feedback would be very welcome.
Edit: I'm the creator of Flookup.
来源:https://stackoverflow.com/questions/33433925/is-there-a-way-to-measure-string-similarity-in-google-bigquery