问题
if i have two strings in mysql:
@a=\"Welcome to Stack Overflow\" @b=\" Hello to stack overflow\";
is there a way to get the similarity percentage between those two string using MYSQL?
here for example 3 words are similar and thus the similarity should be something like:
count(similar words between @a and @b) / (count(@a)+count(@b) - count(intersection))
and thus the result is 3/(4 + 4 - 3)= 0.6
any idea is highly appreciated!
回答1:
you can use this function (cop^H^H^Hadapted from http://www.artfulsoftware.com/infotree/queries.php#552):
CREATE FUNCTION `levenshtein`( s1 text, s2 text) RETURNS int(11)
DETERMINISTIC
BEGIN
DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
DECLARE s1_char CHAR;
DECLARE cv0, cv1 text;
SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0;
IF s1 = s2 THEN
RETURN 0;
ELSEIF s1_len = 0 THEN
RETURN s2_len;
ELSEIF s2_len = 0 THEN
RETURN s1_len;
ELSE
WHILE j <= s2_len DO
SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1;
END WHILE;
WHILE i <= s1_len DO
SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1;
WHILE j <= s2_len DO
SET c = c + 1;
IF s1_char = SUBSTRING(s2, j, 1) THEN
SET cost = 0; ELSE SET cost = 1;
END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;
IF c > c_temp THEN SET c = c_temp; END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
IF c > c_temp THEN
SET c = c_temp;
END IF;
SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
END WHILE;
SET cv1 = cv0, i = i + 1;
END WHILE;
END IF;
RETURN c;
END
and for getting it as XX% use this function
CREATE FUNCTION `levenshtein_ratio`( s1 text, s2 text ) RETURNS int(11)
DETERMINISTIC
BEGIN
DECLARE s1_len, s2_len, max_len INT;
SET s1_len = LENGTH(s1), s2_len = LENGTH(s2);
IF s1_len > s2_len THEN
SET max_len = s1_len;
ELSE
SET max_len = s2_len;
END IF;
RETURN ROUND((1 - LEVENSHTEIN(s1, s2) / max_len) * 100);
END
回答2:
I don't think there's a nice, single-step query way to do this - the natural language stuff is designed mostly for "google-like" search, which sounds different to what you're trying to do.
Depending on what you're actually trying to do - I assume you've left out a lot of detail - I would:
create a table into which you split each string into words, all in lower case, stripping out spaces and punctuation - in your example, you'd end up with:
string_id word 1 hello 1 from 1 stack 1 overflow 2 welcome 2 from 2 stack 2 overflow
You can then run queries against that table - e.g.
select count(*)
from stringWords
where string_id = 2
and word in
(select word
from stringWords
where string_id = 1);
gives you the intersection.
You can then create a function or similar to calculate similarity according to your formula.
Not very clean, but it should perform pretty snappily, it's mostly relational, and it should be largely language independent. To deal with possible typos, you could calculate the soundex - this would allow you to compare "stack" with "stak" and see how similar they really are, though this doesn't work reliably for languages other than English.
回答3:
You can try the SOUNDEX algorithm, take a look here :)
SOUNDEX MySQL
EDIT 1:
Maybe this link about natural language processing with MySQL could be useful
Natural Language Full-Text Searches
How to find similar results and sort by similarity?
HTH!
回答4:
This might be of help to you if you do not want to write your own algorithms :
http://dev.mysql.com/doc/refman/5.0/en/fulltext-natural-language.html
来源:https://stackoverflow.com/questions/5322917/how-to-compute-similarity-between-two-strings-in-mysql