levenshtein-distance

string comparison in python but not Levenshtein distance (I think)

做~自己de王妃 提交于 2019-12-06 02:26:21
I found a crude string comparison in a paper I am reading done as follows: The equation they use is as follows (extracted from the paper with small word changes to make it more general and readable) I have tried to explain a bit more in my own words since the description by the author is not very clear (using an example by the author) For example for 2 sequences ABCDE and BCEFA, there are two possible graphs graph 1) which connects B with B C with C and E with E graph 2) connects A with A I cannot connect A with A when I am connecting the other three (graph 1) since that would be crossing

String Distance Matrix in Python

家住魔仙堡 提交于 2019-12-06 01:25:20
问题 How to calculate Levenshtein Distance matrix of strings in Python str1 str2 str3 str4 ... strn str1 0.8 0.4 0.6 0.1 ... 0.2 str2 0.4 0.7 0.5 0.1 ... 0.1 str3 0.6 0.5 0.6 0.1 ... 0.1 str4 0.1 0.1 0.1 0.5 ... 0.6 . . . . . ... . . . . . . ... . . . . . . ... . strn 0.2 0.1 0.1 0.6 ... 0.7 Using Ditance function we can calculate distance betwwen 2 words. But here I have 1 list containing n number of strings. I wanted to calculate distance matrix after that I want to do clustering of words. 回答1:

What is an efficient way to measure similarity between two strings? (Levenshtein Distance makes stack too deep)

荒凉一梦 提交于 2019-12-05 23:51:55
So, I started with this: http://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance#Ruby Which works great for really small strings. But, my strings can be upwards of 10,000 characters long -- and since the Levenshtein Distance is recursive, this causes a stack too deep error in my Ruby on Rails app. So, is there another, maybe less stack intensive method of finding the similarity between two large strings? Alternatively, I'd need a way to make the stack have much larger size. (I don't think this is the right way to solve the problem, though) Consider a non-recursive

Algorithm to find edit distance to all substrings

放肆的年华 提交于 2019-12-05 17:10:26
问题 Given 2 strings s and t . I need to find for each substring in s edit distance(Levenshtein distance) to t . Actually I need to know for each i position in s what is the minimum edit distance for all substrings started at position i . For example: t = "ab" s = "sdabcb" And I need to get something like: {2,1,0,2,2} Explanation: 1st position: distance("ab", "sd") = 4 ( 2*subst ) distance("ab", "sda") = 3( 2*delete + insert ) distance("ab", "sdab") = 2 ( 2 * delete) distance("ab", "sdabc") = 3 (

Levenshtein algorithm - fail-fast if edit distance is bigger than a given threshold

为君一笑 提交于 2019-12-05 12:17:58
For the Levenshtein algorithm I have found this implementation for Delphi . I need a version which stops as soon as a maximum distance is hit, and return the distance found so far. My first idea is to check the current result after every iteration: for i := 1 to n do for j := 1 to m do begin d[i, j] := Min(Min(d[i-1, j]+1, d[i,j-1]+1), d[i-1,j-1]+Integer(s[i] <> t[j])); // check Result := d[n, m]; if Result > max then begin Exit; end; end; I gather what you want is to find the levenstein distance, if it is below MAX , right? If so, reaching a value larger than MAX is not enough, since it only

Longest Common Substring with wrong character tolerance

折月煮酒 提交于 2019-12-05 12:01:08
I have a script I found on here that works well when looking for the Lowest Common Substring. However, I need it to tolerate some incorrect/missing characters. I would like be able to either input a percentage of similarity required, or perhaps specify the number of missing/wrong characters allowable. For example, I want to find this string: big yellow school bus inside of this string: they rode the bigyellow schook bus that afternoon This is the code i'm currently using: function longest_common_substring($words) { $words = array_map('strtolower', array_map('trim', $words)); $sort_by_strlen =

Sqlite with real “Full Text Search” and spelling mistakes (FTS+spellfix together)

北城以北 提交于 2019-12-05 09:14:25
Let's say we have 1 million of rows like this: import sqlite3 db = sqlite3.connect(':memory:') c = db.cursor() c.execute('CREATE TABLE mytable (id integer, description text)') c.execute('INSERT INTO mytable VALUES (1, "Riemann")') c.execute('INSERT INTO mytable VALUES (2, "All the Carmichael numbers")') Background: I know how to do this with Sqlite: Find a row with a single-word query , up to a few spelling mistakes with the spellfix module and Levenshtein distance (I have posted a detailed answer here about how to compile it, how to use it, ...): db.enable_load_extension(True) db.load

Shortest Levenshtein Distance? Do I need it?

强颜欢笑 提交于 2019-12-05 06:19:13
I want to look up a String in a String[] for the best match of the query. I have heard of Levenshtein Distance. But I cannot determine if I need it or not. Suppose, I have a String query = "Examples" and String[] arrayStr = new String[] {"The Examples String", "The Example String", "Example", "Examples String", "Example String", "Examplestring"}; Now, I want to get the Example from the String[] as the best match. So, Do I need Levenshtein Distance to do it? Alternatively, If someone can point me a fast implementation of Levenshtein Distance for Java, it would be great. I would like to check if

Levenshtein distance symmetric?

落爺英雄遲暮 提交于 2019-12-05 02:46:58
I was informed Levenshtein distance is symmetric. When I used google's diffMatchPatch tool which computes Levenshtein distance among other things, the results don't imply Levenshtein distance is symmetric. i.e Levenshtein(x1,x2) is not equal to Levenshtein(x2,x1). Is Levenshtein not symmetric or is there a problem with that particular implementation? Thanks. Just looking at the basic algorithm it definitely is symmetric given the same cost for the operations - the number of additions, deletions and substitutions to get from a word A to a word B is the same as getting from word B to word A. If

how to convert python/cython unicode string to array of long integers, to do levenshtein edit distance [duplicate]

空扰寡人 提交于 2019-12-05 02:29:34
问题 This question already has an answer here : Closed 7 years ago . Possible Duplicate: How to correct bugs in this Damerau-Levenshtein implementation? I have the following Cython code (adapted from the bpbio project) that does Damerau-Levenenshtein edit-distance calculation: #--------------------------------------------------------------------------- cdef extern from "stdlib.h": ctypedef unsigned int size_t size_t strlen(char *s) void *malloc(size_t size) void *calloc(size_t n, size_t size) void