levenshtein-distance | 易学教程

What is an efficient way to measure similarity between two strings? (Levenshtein Distance makes stack too deep)

阅读更多关于 What is an efficient way to measure similarity between two strings? (Levenshtein Distance makes stack too deep)

问题 So, I started with this: http://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance#Ruby Which works great for really small strings. But, my strings can be upwards of 10,000 characters long -- and since the Levenshtein Distance is recursive, this causes a stack too deep error in my Ruby on Rails app. So, is there another, maybe less stack intensive method of finding the similarity between two large strings? Alternatively, I'd need a way to make the stack have much

string comparison in python but not Levenshtein distance (I think)

阅读更多关于 string comparison in python but not Levenshtein distance (I think)

问题 I found a crude string comparison in a paper I am reading done as follows: The equation they use is as follows (extracted from the paper with small word changes to make it more general and readable) I have tried to explain a bit more in my own words since the description by the author is not very clear (using an example by the author) For example for 2 sequences ABCDE and BCEFA, there are two possible graphs graph 1) which connects B with B C with C and E with E graph 2) connects A with A I

Longest Common Substring with wrong character tolerance

阅读更多关于 Longest Common Substring with wrong character tolerance

问题 I have a script I found on here that works well when looking for the Lowest Common Substring. However, I need it to tolerate some incorrect/missing characters. I would like be able to either input a percentage of similarity required, or perhaps specify the number of missing/wrong characters allowable. For example, I want to find this string: big yellow school bus inside of this string: they rode the bigyellow schook bus that afternoon This is the code i'm currently using: function longest

Levenshtein algorithm - fail-fast if edit distance is bigger than a given threshold

阅读更多关于 Levenshtein algorithm - fail-fast if edit distance is bigger than a given threshold

问题 For the Levenshtein algorithm I have found this implementation for Delphi. I need a version which stops as soon as a maximum distance is hit, and return the distance found so far. My first idea is to check the current result after every iteration: for i := 1 to n do for j := 1 to m do begin d[i, j] := Min(Min(d[i-1, j]+1, d[i,j-1]+1), d[i-1,j-1]+Integer(s[i] <> t[j])); // check Result := d[n, m]; if Result > max then begin Exit; end; end; 回答1: I gather what you want is to find the levenstein

Sqlite with real “Full Text Search” and spelling mistakes (FTS+spellfix together)

阅读更多关于 Sqlite with real “Full Text Search” and spelling mistakes (FTS+spellfix together)

问题 Let's say we have 1 million of rows like this: import sqlite3 db = sqlite3.connect(':memory:') c = db.cursor() c.execute('CREATE TABLE mytable (id integer, description text)') c.execute('INSERT INTO mytable VALUES (1, "Riemann")') c.execute('INSERT INTO mytable VALUES (2, "All the Carmichael numbers")') Background: I know how to do this with Sqlite: Find a row with a single-word query , up to a few spelling mistakes with the spellfix module and Levenshtein distance (I have posted a detailed

Optimize speed of Levenshtein distance of many words

阅读更多关于 Optimize speed of Levenshtein distance of many words

I have a cell array dictionary which contains a lot of words (ca. 15000). I want to compute the function strdist (to calculate the Levenshtein distance) for all the couples of words. I tried in two ways, but they are both really slow. What can be a more efficient solution? Here is my code (dict_keys is my cell array of length m): 1) matrix = sparse(m,m); for i = 1:m-1; matrix(i,:) = cellfun( @(u) strdist(dict_keys{i},u), dict_keys ); end 2) matrix = sparse(m,m); for i = 1:m-1; for j = i+1:m matrix(i,j) = strdist(dict_keys{i},dict_keys{j}); end end Function 'strdist' is not an inbuilt matlab

Complexity of edit distance (Levenshtein distance) recursion top down implementation

阅读更多关于 Complexity of edit distance (Levenshtein distance) recursion top down implementation

问题 I have been working all day with a problem which I can't seem to get a handle on. The task is to show that a recursive implementation of edit distance has the time complexity Ω(2 max(n,m) ) where n & m are the length of the words being measured. The implementation is comparable to this small python example def lev(a, b): if("" == a): return len(b) # returns if a is an empty string if("" == b): return len(a) # returns if b is an empty string return min(lev(a[:-1], b[:-1])+(a[-1] != b[-1]), lev

Get the most repeated similar fields in MySQL database

阅读更多关于 Get the most repeated similar fields in MySQL database

Let's assume we have a database like: Actions_tbl: -------------------------------------------------------- id | Action_name | user_id| -------------------------------------------------------- 1 | John reads one book | 1 2 | reading the book by john | 1 3 | Joe is jumping over fire | 2 4 | reading another book | 2 5 | John reads the book in library | 1 6 | Joe read a book | 2 7 | read a book | 3 8 | jumping with no reason is Ronald's habit| 3 Users_tbl: ----------------------- user_id | user_name | ----------------------- 1 | John 2 | Joe 3 | Ronald 4 | Araz ----------------------- Wondering

How can I create a threshold for similar strings using Levenshtein distance and account for typos?

阅读更多关于 How can I create a threshold for similar strings using Levenshtein distance and account for typos?

问题 We recently encountered an interesting problem at work where we discovered duplicate user submitted data in our database. We realized that the Levenshtein distance between most of this data was simply the difference between the 2 strings in question. That indicates that if we simply add characters from one string into the other then we end up with the same string, and for most things this seems like the best way for us to account for items that are duplicate. We also want to account for typos

Edit Distance Algorithm

阅读更多关于 Edit Distance Algorithm

问题 I have a dictionary of 'n' words given and there are 'm' Queries to respond to. I want to output the number of words in dictionary which are edit distance 1 or 2. I want to optimize the result set given that n and m are roughly 3000. Edit added from answer below: I will try to word it differently. Initially there are 'n' words given as a set of Dictionary words. Next 'm' words are given which are query words and for each query word, I need to find if the word already exists in Dictionary