levenshtein-distance

Levenshtein distance: how to better handle words swapping positions?

非 Y 不嫁゛ 提交于 2019-12-03 01:04:50
问题 I've had some success comparing strings using the PHP levenshtein function. However, for two strings which contain substrings that have swapped positions, the algorithm counts those as whole new substrings. For example: levenshtein("The quick brown fox", "brown quick The fox"); // 10 differences are treated as having less in common than: levenshtein("The quick brown fox", "The quiet swine flu"); // 9 differences I'd prefer an algorithm which saw that the first two were more similar. How could

Compare similarity algorithms

▼魔方 西西 提交于 2019-12-03 00:41:19
问题 I want to use string similarity functions to find corrupted data in my database. I came upon several of them: Jaro, Jaro-Winkler, Levenshtein, Euclidean and Q-gram, I wanted to know what is the difference between them and in what situations they work best? 回答1: Expanding on my wiki-walk comment in the errata and noting some of the ground-floor literature on the comparability of algorithms that apply to similar problem spaces, let's explore the applicability of these algorithms before we

Is there an edit distance algorithm that takes “chunk transposition” into account?

主宰稳场 提交于 2019-12-03 00:17:49
I put "chunk transposition" in quotes because I don't know whether or what the technical term should be. Just knowing if there is a technical term for the process would be very helpful. The Wikipedia article on edit distance gives some good background on the concept. By taking "chunk transposition" into account, I mean that Turing, Alan. should match Alan Turing more closely than it matches Turing Machine I.e. the distance calculation should detect when substrings of the text have simply been moved within the text. This is not the case with the common Levenshtein distance formula. The strings

Levenshtein Distance: Inferring the edit operations from the matrix

那年仲夏 提交于 2019-12-02 21:16:32
I wrote Levenshtein algorithm in in C++ If I input: string s: democrat string t: republican I get the matrix D filled-up and the number of operations (the Levenshtein distance) can be read in D[10][8] = 8 Beyond the filled matrix I want to construct the optimal solution. How must look this solution? I don't have an idea. Please only write me HOW MUST LOOK for this example. The question is Given the matrix produced by the Levenshtein algorithm, how can one find " the optimal solution "? i.e. how can we find the precise sequence of string operations: inserts, deletes and substitution [of a

How to speed up Levenshtein distance calculation

有些话、适合烂在心里 提交于 2019-12-02 20:57:05
I am trying to run a simulation to test the average Levenshtein distance between random binary strings. My program is in python but I am using this C extension . The function that is relevant and takes most of the time computes the Levenshtein distance between two strings and is this. lev_edit_distance(size_t len1, const lev_byte *string1, size_t len2, const lev_byte *string2, int xcost) { size_t i; size_t *row; /* we only need to keep one row of costs */ size_t *end; size_t half; /* strip common prefix */ while (len1 > 0 && len2 > 0 && *string1 == *string2) { len1--; len2--; string1++;

Edit distance between two graphs

旧街凉风 提交于 2019-12-02 20:35:37
I'm just wondering if, like for strings where we have the Levenshtein distance (or edit distance) between two strings, is there something similar for graphs? I mean, a scalar measure that identifies the number of atomic operations (node and edges insertion/deletion) to transform a graph G1 to a graph G2 . I think graph edit distance is the measure that you were looking for. Graph edit distance measures the minimum number of graph edit operations to transform one graph to another, and the allowed graph edit operations includes: Insert/delete an isolated vertex Insert/delete an edge Change the

Hamming Distance vs. Levenshtein Distance

风格不统一 提交于 2019-12-02 18:43:53
For the problem I'm working on, finding distances between two sequences to determine their similarity, sequence order is very important. However, the sequences that I have are not all the same length, so I pad any deficient strings with empty points such that both sequences are the same length in order to satisfy the Hamming distance requirement. Is there any major problem with me doing this, since all I care about are the number of transpositions (not insertions or deletions like Levenshtein does)? I've found that Hamming distance is much, much faster than Levenshtein as a distance metric for

Improving search result using Levenshtein distance in Java

耗尽温柔 提交于 2019-12-02 17:40:07
I have following working Java code for searching for a word against a list of words and it works perfectly and as expected: public class Levenshtein { private int[][] wordMartix; public Set similarExists(String searchWord) { int maxDistance = searchWord.length(); int curDistance; int sumCurMax; String checkWord; // preventing double words on returning list Set<String> fuzzyWordList = new HashSet<>(); for (Object wordList : Searcher.wordList) { checkWord = String.valueOf(wordList); curDistance = calculateDistance(searchWord, checkWord); sumCurMax = maxDistance + curDistance; if (sumCurMax ==

How can I optimize this Python code to generate all words with word-distance 1?

穿精又带淫゛_ 提交于 2019-12-02 16:14:21
Profiling shows this is the slowest segment of my code for a little word game I wrote: def distance(word1, word2): difference = 0 for i in range(len(word1)): if word1[i] != word2[i]: difference += 1 return difference def getchildren(word, wordlist): return [ w for w in wordlist if distance(word, w) == 1 ] Notes: distance() is called over 5 million times, majority of which is from getchildren, which is supposed to get all words in the wordlist that differ from word by exactly 1 letter. wordlist is pre-filtered to only have words containing the same number of letters as word so it's guaranteed

Compare 5000 strings with PHP Levenshtein

一世执手 提交于 2019-12-02 15:19:41
I have 5000, sometimes more, street address strings in an array. I'd like to compare them all with levenshtein to find similar matches. How can I do this without looping through all 5000 and comparing them directly with every other 4999? Edit: I am also interested in alternate methods if anyone has suggestions. The overall goal is to find similar entries (and eliminate duplicates) based on user-submitted street addresses. I think a better way to group similar addresses would be to: create a database with two tables - one for the address (and a id), one for the soundexes of words or literal