how to compute similarity between two strings in MYSQL

假如想象 提交于 2019-11-26 12:59:35
Alaa

you can use this function (cop^H^H^Hadapted from http://www.artfulsoftware.com/infotree/queries.php#552):

CREATE FUNCTION `levenshtein`( s1 text, s2 text) RETURNS int(11)
    DETERMINISTIC
BEGIN 
    DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT; 
    DECLARE s1_char CHAR; 
    DECLARE cv0, cv1 text; 
    SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0; 
    IF s1 = s2 THEN 
      RETURN 0; 
    ELSEIF s1_len = 0 THEN 
      RETURN s2_len; 
    ELSEIF s2_len = 0 THEN 
      RETURN s1_len; 
    ELSE 
      WHILE j <= s2_len DO 
        SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1; 
      END WHILE; 
      WHILE i <= s1_len DO 
        SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1; 
        WHILE j <= s2_len DO 
          SET c = c + 1; 
          IF s1_char = SUBSTRING(s2, j, 1) THEN  
            SET cost = 0; ELSE SET cost = 1; 
          END IF; 
          SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost; 
          IF c > c_temp THEN SET c = c_temp; END IF; 
            SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1; 
            IF c > c_temp THEN  
              SET c = c_temp;  
            END IF; 
            SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1; 
        END WHILE; 
        SET cv1 = cv0, i = i + 1; 
      END WHILE; 
    END IF; 
    RETURN c; 
  END

and for getting it as XX% use this function

CREATE FUNCTION `levenshtein_ratio`( s1 text, s2 text ) RETURNS int(11)
    DETERMINISTIC
BEGIN 
    DECLARE s1_len, s2_len, max_len INT; 
    SET s1_len = LENGTH(s1), s2_len = LENGTH(s2); 
    IF s1_len > s2_len THEN  
      SET max_len = s1_len;  
    ELSE  
      SET max_len = s2_len;  
    END IF; 
    RETURN ROUND((1 - LEVENSHTEIN(s1, s2) / max_len) * 100); 
  END

I don't think there's a nice, single-step query way to do this - the natural language stuff is designed mostly for "google-like" search, which sounds different to what you're trying to do.

Depending on what you're actually trying to do - I assume you've left out a lot of detail - I would:

  • create a table into which you split each string into words, all in lower case, stripping out spaces and punctuation - in your example, you'd end up with:

    string_id               word
    
    1                       hello
    1                       from
    1                       stack
    1                       overflow
    2                       welcome
    2                       from
    2                       stack
    2                       overflow
    

You can then run queries against that table - e.g.

select count(*)
from  stringWords
where string_id = 2
and word in 
  (select word 
  from stringWords
  where string_id = 1);

gives you the intersection.

You can then create a function or similar to calculate similarity according to your formula.

Not very clean, but it should perform pretty snappily, it's mostly relational, and it should be largely language independent. To deal with possible typos, you could calculate the soundex - this would allow you to compare "stack" with "stak" and see how similar they really are, though this doesn't work reliably for languages other than English.

SubniC

You can try the SOUNDEX algorithm, take a look here :)

SOUNDEX MySQL

EDIT 1:

Maybe this link about natural language processing with MySQL could be useful

Natural Language Full-Text Searches

How to find similar results and sort by similarity?

HTH!

This might be of help to you if you do not want to write your own algorithms :

http://dev.mysql.com/doc/refman/5.0/en/fulltext-natural-language.html

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!