levenshtein-distance

How to compare almost similar Strings in Java? (String distance measure) [closed]

点点圈 提交于 2019-11-26 22:33:49
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 12 months ago . I would like to compare two strings and get some score how much these look alike. For example "The sentence is almost similar" and "The sentence is similar" . I'm not familiar with existing methods in Java, but for PHP I know the levenshtein function. Are there better methods in Java? 回答1: The Levensthein

High performance fuzzy string comparison in Python, use Levenshtein or difflib [closed]

冷暖自知 提交于 2019-11-26 21:18:09
I am doing clinical message normalization (spell check) in which I check each given word against 900,000 word medical dictionary. I am more concern about the time complexity/performance. I want to do fuzzy string comparison, but I'm not sure which library to use. Option 1: import Levenshtein Levenshtein.ratio('hello world', 'hello') Result: 0.625 Option 2: import difflib difflib.SequenceMatcher(None, 'hello world', 'hello').ratio() Result: 0.625 In this example both give the same answer. Do you think both perform alike in this case? In case you're interested in a quick visual comparison of

Similarity Score - Levenshtein

夙愿已清 提交于 2019-11-26 20:25:52
问题 I implemented the Levenshtein algorithm in Java and am now getting the corrections made by the algorithm, a.k.a. the cost. This does help a little but not much since I want the results as a percentage. So I want to know how to calculate those similarity points. I would also like to know how you people do it and why. 回答1: The Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being

Efficient string similarity grouping

半腔热情 提交于 2019-11-26 17:48:14
问题 Setting : I have data on people, and their parent's names, and I want to find siblings (people with identical parent names). pdata<-data.frame(parents_name=c("peter pan + marta steward", "pieter pan + marta steward", "armin dolgner + jane johanna dough", "jack jackson + sombody else")) The expected output here would be a column indicating that the first two observations belong to family X, while the third and fourth columns are each in a separate family. E.g: person_id parents_name family_id

String similarity metrics in Python

China☆狼群 提交于 2019-11-26 08:47:14
问题 I want to find string similarity between two strings. This page has examples of some of them. Python has an implemnetation of Levenshtein algorithm. Is there a better algorithm, (and hopefully a python library), under these contraints. I want to do fuzzy matches between strings. eg matches(\'Hello, All you people\', \'hello, all You peopl\') should return True False negatives are acceptable, False positives, except in extremely rare cases are not. This is done in a non realtime setting, so

Finding closest neighbour using optimized Levenshtein Algorithm

本小妞迷上赌 提交于 2019-11-26 07:47:06
问题 I recently posted a question about optimizing the algorithm to compute the Levenshtein Distance, and the replies lead me to the Wikipedia article on Levenshtein Distance. The article mentioned that if there is a bound k on the maximum distance a possible result can be from the given query, then the running time can be reduced from O(mn) to O(kn) , m and n being the lengths of the strings. I looked up the algorithm, but I couldn\'t really figure out how to implement it. I was hoping to get

High performance fuzzy string comparison in Python, use Levenshtein or difflib [closed]

可紊 提交于 2019-11-26 06:55:28
问题 I am doing clinical message normalization (spell check) in which I check each given word against 900,000 word medical dictionary. I am more concern about the time complexity/performance. I want to do fuzzy string comparison, but I\'m not sure which library to use. Option 1: import Levenshtein Levenshtein.ratio(\'hello world\', \'hello\') Result: 0.625 Option 2: import difflib difflib.SequenceMatcher(None, \'hello world\', \'hello\').ratio() Result: 0.625 In this example both give the same

Levenshtein: MySQL + PHP

自闭症网瘾萝莉.ら 提交于 2019-11-26 03:51:09
$word = strtolower($_GET['term']); $lev = 0; $q = mysql_query("SELECT `term` FROM `words`"); while($r = mysql_fetch_assoc($q)) { $r['term'] = strtolower($r['term']); $lev = levenshtein($word, $r['term']); if($lev >= 0 && $lev < 5) { $word = $r['term']; } } How can I move all that into just one query? Don't want to have to query through all terms and do the filtering in PHP. rik You need a levenshtein function in MySQL and query like $word = mysql_real_escape_string($word); mysql_qery("SELECT `term` FROM `words` WHERE levenshtein('$word', `term`) BETWEEN 0 AND 4"); There are two ways to

What algorithm gives suggestions in a spell checker?

此生再无相见时 提交于 2019-11-26 02:15:49
问题 What algorithm is typically used when implementing a spell checker that is accompanied with word suggestions? At first I thought it might make sense to check each new word typed (if not found in the dictionary) against it\'s Levenshtein distance from every other word in the dictionary and returning the top results. However, this seems like it would be highly inefficient, having to evaluate the entire dictionary repeatedly. How is this typically done? 回答1: There is good essay by Peter Norvig

Levenshtein: MySQL + PHP

心已入冬 提交于 2019-11-26 01:52:53
问题 $word = strtolower($_GET[\'term\']); $lev = 0; $q = mysql_query(\"SELECT `term` FROM `words`\"); while($r = mysql_fetch_assoc($q)) { $r[\'term\'] = strtolower($r[\'term\']); $lev = levenshtein($word, $r[\'term\']); if($lev >= 0 && $lev < 5) { $word = $r[\'term\']; } } How can I move all that into just one query? Don\'t want to have to query through all terms and do the filtering in PHP. 回答1: You need a levenshtein function in MySQL and query like $word = mysql_real_escape_string($word); mysql