levenshtein-distance | 易学教程

Reverse Levenshtein distance

阅读更多关于 Reverse Levenshtein distance

问题 In levenshtein distance you ask the question, given these two strings, what is their levenshtein distance. How would you go about taking a string and a levenshtein distance and generating all the strings within that levenshtein distance. (It would also take in a character set). So if i pass in a string x and a distance d. then it would give me all the strings within that edit distance, including d-1 and d-2....d-n; (n < d). Expected functionality: >>> getWithinDistance('apple',2,{'a','b',' '}

Using Levenshtein function on each element in a tsvector?

阅读更多关于 Using Levenshtein function on each element in a tsvector?

问题 I'm trying to create a fuzzy search using Postgres and have been using django-watson as a base search engine to work off of. I have a field called search_tsv that its a tsvector containing all the field values of the model that I want to search on. I was wanting to use the Levenshtein function, which does exactly what I want on a text field. However, I dont really know how to run it on each individual element of the tsvector. Is there a way to do this? 回答1: I would consider using the

OCR: weighted Levenshtein distance

阅读更多关于 OCR: weighted Levenshtein distance

问题 I'm trying to create an optical character recognition system with the dictionary. In fact I don't have an implemented dictionary yet=) I've heard that there are simple metrics based on Levenstein distance which take in account different distance between different symbols. E.g. 'N' and 'H' are very close to each other and d("THEATRE", "TNEATRE") should be less than d("THEATRE", "TOEATRE") which is impossible using basic Levenstein distance. Could you help me locating such metric, please. 回答1:

Modifying Levenshtein Distance algorithm to not calculate all distances

阅读更多关于 Modifying Levenshtein Distance algorithm to not calculate all distances

问题 I'm working on a fuzzy search implementation and as part of the implementation, we're using Apache's StringUtils.getLevenshteinDistance. At the moment, we're going for a specific maxmimum average response time for our fuzzy search. After various enhancements and with some profiling, the place where the most time is spent is calculating the Levenshtein distance. It takes up roughly 80-90% of the total time on search strings three letters or more. Now, I know there are some limitations to what

Fast fuzzy/approximate search in dictionary of strings in Ruby

阅读更多关于 Fast fuzzy/approximate search in dictionary of strings in Ruby

问题 I have a dictionary of 50K to 100K strings (can be up to 50+ characters) and I am trying to find whether a given string is in the dictionary with some "edit" distance tolerance. (Levenshtein for example). I am fine pre-computing any type of data structure before doing the search. My goal to run thousands of strings against that dictionary as fast as possible and returns the closest neighbor. I would be fine just getting a boolean that say whether a given is in the dictionary or not if there

Levenshtein DFA in .NET

阅读更多关于 Levenshtein DFA in .NET

问题 Good afternoon, Does anyone know of an "out-of-the-box" implementation of Levenshtein DFA ( deterministic finite automata ) in .NET (or easily translatable to it)? I have a very big dictionary with more than 160000 different words, and I want to, given an inicial word w , find all known words at Levenshtein distance at most 2 of w in an efficient way. Of course, having a function which computes all possible edits at edit distance one of a given word and applying it again to each of these

PHP - Finding number of matching words between two pieces of text?

阅读更多关于 PHP - Finding number of matching words between two pieces of text?

问题 I want to find number of similar words between two texts Example $str1=the cat is on the roof $str2=the mouse is on the roof the,is,on,the,roof words are similar in $str1 and $str2 So output will be in number 5 OR In percentage 86% I am try similar_text() function but this function not work as which i want. 回答1: Easy, explode them and then use array_diff: $totalWords = count($array_1); $array_1 = explode(" ", $str1); $array_2 = explode(" ", $str2); $differenceCount = count(array_diff($array_1

Approximate String Matching in R

阅读更多关于 Approximate String Matching in R

问题 for my research I have to match two data sets containing fund information. Unfortunately there is no common identifier. The good thing is that I have an identifier in both documents for the document number which however can contain multiple funds. If there are multiple funds in the document (e.g. 20) I can only match via the fund's name which can differ sometimes slightly. Note that the number of funds per document is identical in noth data sets. After searching a little bit I tried to use

mySQL - matching latin (english) form input to utf8 (non-English) data

阅读更多关于 mySQL - matching latin (english) form input to utf8 (non-English) data

问题 I maintain a music database in mySQL, how do I return results stored under e.g. 'Tiësto' when people search for 'Tiesto'? All the data is stored under full text indexing, if that makes any difference. I'm already employing a combination of Levenshtein in PHP and REGEXP in SQL - not in trying to solve this problem, but just for increased searchability in general. PHP: function Levenshtein($word) { $words = array(); for ($i = 0; $i < strlen($word); $i++) { $words[] = substr($word, 0, $i) . '_'

Damerau-Levenshtein distance Implementation

阅读更多关于 Damerau-Levenshtein distance Implementation

问题 I'm trying to create a damerau-levenshtein distance function in JS. I've found a description off the algorithm on WIkipedia, but they is no implementation off it. It says: To devise a proper algorithm to calculate unrestricted Damerau–Levenshtein distance note that there always exists an optimal sequence of edit operations, where once-transposed letters are never modified afterwards. Thus, we need to consider only two symmetric ways of modifying a substring more than once: (1) transpose