how to compute similarity between two strings in MYSQL

蓝咒 提交于 2019-11-26 03:09:55

问题


if i have two strings in mysql:

@a=\"Welcome to Stack Overflow\"
@b=\" Hello to stack overflow\";

is there a way to get the similarity percentage between those two string using MYSQL? here for example 3 words are similar and thus the similarity should be something like:
count(similar words between @a and @b) / (count(@a)+count(@b) - count(intersection))
and thus the result is 3/(4 + 4 - 3)= 0.6
any idea is highly appreciated!


回答1:


you can use this function (cop^H^H^Hadapted from http://www.artfulsoftware.com/infotree/queries.php#552):

CREATE FUNCTION `levenshtein`( s1 text, s2 text) RETURNS int(11)
    DETERMINISTIC
BEGIN 
    DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT; 
    DECLARE s1_char CHAR; 
    DECLARE cv0, cv1 text; 
    SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0; 
    IF s1 = s2 THEN 
      RETURN 0; 
    ELSEIF s1_len = 0 THEN 
      RETURN s2_len; 
    ELSEIF s2_len = 0 THEN 
      RETURN s1_len; 
    ELSE 
      WHILE j <= s2_len DO 
        SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1; 
      END WHILE; 
      WHILE i <= s1_len DO 
        SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1; 
        WHILE j <= s2_len DO 
          SET c = c + 1; 
          IF s1_char = SUBSTRING(s2, j, 1) THEN  
            SET cost = 0; ELSE SET cost = 1; 
          END IF; 
          SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost; 
          IF c > c_temp THEN SET c = c_temp; END IF; 
            SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1; 
            IF c > c_temp THEN  
              SET c = c_temp;  
            END IF; 
            SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1; 
        END WHILE; 
        SET cv1 = cv0, i = i + 1; 
      END WHILE; 
    END IF; 
    RETURN c; 
  END

and for getting it as XX% use this function

CREATE FUNCTION `levenshtein_ratio`( s1 text, s2 text ) RETURNS int(11)
    DETERMINISTIC
BEGIN 
    DECLARE s1_len, s2_len, max_len INT; 
    SET s1_len = LENGTH(s1), s2_len = LENGTH(s2); 
    IF s1_len > s2_len THEN  
      SET max_len = s1_len;  
    ELSE  
      SET max_len = s2_len;  
    END IF; 
    RETURN ROUND((1 - LEVENSHTEIN(s1, s2) / max_len) * 100); 
  END



回答2:


I don't think there's a nice, single-step query way to do this - the natural language stuff is designed mostly for "google-like" search, which sounds different to what you're trying to do.

Depending on what you're actually trying to do - I assume you've left out a lot of detail - I would:

  • create a table into which you split each string into words, all in lower case, stripping out spaces and punctuation - in your example, you'd end up with:

    string_id               word
    
    1                       hello
    1                       from
    1                       stack
    1                       overflow
    2                       welcome
    2                       from
    2                       stack
    2                       overflow
    

You can then run queries against that table - e.g.

select count(*)
from  stringWords
where string_id = 2
and word in 
  (select word 
  from stringWords
  where string_id = 1);

gives you the intersection.

You can then create a function or similar to calculate similarity according to your formula.

Not very clean, but it should perform pretty snappily, it's mostly relational, and it should be largely language independent. To deal with possible typos, you could calculate the soundex - this would allow you to compare "stack" with "stak" and see how similar they really are, though this doesn't work reliably for languages other than English.




回答3:


You can try the SOUNDEX algorithm, take a look here :)

SOUNDEX MySQL

EDIT 1:

Maybe this link about natural language processing with MySQL could be useful

Natural Language Full-Text Searches

How to find similar results and sort by similarity?

HTH!




回答4:


This might be of help to you if you do not want to write your own algorithms :

http://dev.mysql.com/doc/refman/5.0/en/fulltext-natural-language.html



来源:https://stackoverflow.com/questions/5322917/how-to-compute-similarity-between-two-strings-in-mysql

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!