levenshtein-distance | 易学教程

Is there a faster (less precise) algorithm than Levenshtein for string distance?

阅读更多关于 Is there a faster (less precise) algorithm than Levenshtein for string distance?

问题 I want to run the Levenshtein, but WAY faster because it's real time application that I'm building. It can terminate once the distance is greater than 10. 回答1: The Levenshtein distance metric allows addition, deletion or substitution operations. If you're looking for a faster but less precise metric you can use the longest common subsequence (allows only addition and deletion) or even the Hamming distance (allows only substitution). However, I recommend that you try to optimize your

Similarity Score - Levenshtein

阅读更多关于 Similarity Score - Levenshtein

I implemented the Levenshtein algorithm in Java and am now getting the corrections made by the algorithm, a.k.a. the cost. This does help a little but not much since I want the results as a percentage. So I want to know how to calculate those similarity points. I would also like to know how you people do it and why. Ralph The Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character. (Wikipedia) So a Levenshtein distance of

How to sort an array by similarity in relation to an inputted word.

阅读更多关于 How to sort an array by similarity in relation to an inputted word.

I have on PHP array, for example: $arr = array("hello", "try", "hel", "hey hello"); Now I want to do rearrange of the array which will be based on the most nearly close words between the array and my $search var. How can I do that? This could be a quick solution by using http://php.net/manual/en/function.similar-text.php : This calculates the similarity between two strings as described in Programming Classics: Implementing the World's Best Algorithms by Oliver (ISBN 0-131-00413-1). Note that this implementation does not use a stack as in Oliver's pseudo code, but recursive calls which may or

How to compare almost similar Strings in Java? (String distance measure) [closed]

阅读更多关于 How to compare almost similar Strings in Java? (String distance measure) [closed]

I would like to compare two strings and get some score how much these look alike. For example "The sentence is almost similar" and "The sentence is similar" . I'm not familiar with existing methods in Java, but for PHP I know the levenshtein function . Are there better methods in Java? Joey The Levensthein distance is a measure for how similar strings are. Or, more precisely, how many alterations have to be made that they are the same. The algorithm is available in pseudo-code on Wikipedia. Converting that to Java shouldn't be much of a problem, but it's not built-in into the base class

Damerau-Levenshtein php

阅读更多关于 Damerau-Levenshtein php

问题 I'm searching for an implementations of the Damerau–Levenshtein algorithm for PHP, but it seems that I can't find anything with my friend google. So far I have to use PHP implemented Levenshtein (without Damerau transposition, which is very important), or get a original source code (in C, C++, C#, Perl) and write (translate) it to PHP. Does anybody have any knowledge of a PHP implementation ? I'm using soundex and double metaphone for a "Did you mean:" extension on my corporate intranet, and

How to configure Solr to use Levenshtein approximate string matching?

阅读更多关于 How to configure Solr to use Levenshtein approximate string matching?

问题 Does Apaches Solr search engine provide approximate string matches, e.g. via Levenshtein algorithm? I'm looking for a way to find customers by last name. But I cannot guarantee the correctness of the names. How can I configure Solr so that it would find the person "Levenshtein" even if I search for "Levenstein" ? 回答1: Typically this is done with the SpellCheckComponent, which internally uses the Lucene SpellChecker by default, which implements Levenshtein. The wiki really explains very well

Percentage rank of matches using Levenshtein Distance matching

阅读更多关于 Percentage rank of matches using Levenshtein Distance matching

问题 I am trying to match a single search term against a dictionary of possible matches using a Levenshtein distance algorithm. The algorithm returns a distance expressed as number of operations required to convert the search string into the matched string. I want to present the results in ranked percentage list of top "N" (say 10) matches. Since the search string can be longer or shorter than the individual dictionary strings, what would be an appropriate logic for expressing the distance as a

How python-Levenshtein.ratio is computed

阅读更多关于 How python-Levenshtein.ratio is computed

问题 According to the python-Levenshtein.ratio source: https://github.com/miohtama/python-Levenshtein/blob/master/Levenshtein.c#L722 it's computed as (lensum - ldist) / lensum . This works for distance('ab', 'a') = 1 ratio('ab', 'a') = 0.666666 However, it seems to break with distance('ab', 'ac') = 1 ratio('ab', 'ac') = 0.5 I feel I must be missing something very simple.. but why not 0.75 ? 回答1: Levenshtein distance for 'ab' and 'ac' as below: so alignment is: a c a b Alignment length = 2 number

Text clustering with Levenshtein distances

阅读更多关于 Text clustering with Levenshtein distances

I have a set (2k - 4k) of small strings (3-6 characters) and I want to cluster them. Since I use strings, previous answers on How does clustering (especially String clustering) work? , informed me that Levenshtein distance is good to be used as a distance function for strings. Also, since I do not know in advance the number of clusters, hierarchical clustering is the way to go and not k-means. Although I get the problem in its abstract form, I do not know what is the easie way to actually do it. For example, is MATLAB or R a better choice for the actual implementation of hierarchical

Fuzzy matching of product names

阅读更多关于 Fuzzy matching of product names

问题 I need to automatically match product names (cameras, laptops, tv-s etc) that come from different sources to a canonical name in the database. For example "Canon PowerShot a20IS" , "NEW powershot A20 IS from Canon" and "Digital Camera Canon PS A20IS" should all match "Canon PowerShot A20 IS" . I've worked with levenshtein distance with some added heuristics (removing obvious common words, assigning higher cost to number changes etc), which works to some extent, but not well enough