levenshtein-distance | 易学教程

mySQL: Using Levenshtein distance to find duplicates in 20,000 rows

阅读更多关于 mySQL: Using Levenshtein distance to find duplicates in 20,000 rows

问题 I basically have a two column table containing a primary key and names of companies with about 20,000 rows. My task is to find all duplicate entries. I originally tried using soundex, but it would match companies that were completely different, just because they had similar first words. So this led me on to the levenshtein distance algorithm. The problem is, the query takes an indefinite amount of time. I've left it for about 10 hours now, it still hasn't responded. Here is the query: SELECT

Smartest way to double loop over a data frame (comparing rows to each other with a Levenshtein Dist) in R?

阅读更多关于 Smartest way to double loop over a data frame (comparing rows to each other with a Levenshtein Dist) in R?

问题 I cooked a df of paramStrings over several records: idName Str 1 Аэрофлот_Эконом 95111000210102121111010100111000100110101001 2 Аэрофлот_Комфорт 95111000210102121111010100111000100110101001 3 Аэрофлот_Бизнес 96111000210102121111010100111000100110101001 4 Трансаэро_Дисконт 26111000210102120000010100001010000010001000 5 Трансаэро_Туристический 26111000210002120000010100001010000010001000 6 Трансаэро_Эконом 26111000210002120000010100001010000010001000 Now I need to compare each one against

Difference in normalization of Levenshtein (edit) distance?

阅读更多关于 Difference in normalization of Levenshtein (edit) distance?

问题 If the Levenshtein distance between two strings, s and t is given by L(s,t) , what is the difference in the impact on the resulting heuristic of the following two different normalization schemes? L(s,t) / [length(s) + length(t)] L(s,t) / max[length(s), length(t)] (L(s,t)*2) / [length(s) + length(t)] I noticed that normalization approach 2 is recommended by the Levenshtein distance Wikipedia page but no mention is made of approach 1. Are both approaches equally valid? Just wondering if there

bitparallel weighted Levenshtein distance

阅读更多关于 bitparallel weighted Levenshtein distance

问题 I am using a weighted Levenshtein distance with the following costs: insertion: 1 deletion: 1 replacement: 2 As pointed out by wildwasser in a comment, this means, that a substitution is treated as an insertion and a deletion. So substitutions could be avoided by the algorithm. For the normal implementation with a cost of 1 for each operation there are multiple bitparallel implementations like e.g. Myers/Hyyrö: static const uint64_t masks[64] = { 0x0000000000000001, 0x0000000000000003,

Swift: How can the dictionary values be arranged based on each item's Levenshtein Distance number

阅读更多关于 Swift: How can the dictionary values be arranged based on each item's Levenshtein Distance number

来源： https://stackoverflow.com/questions/63559249/swift-how-can-the-dictionary-values-be-arranged-based-on-each-items-levenshtei

How can I generate all variants of a word within 1-edit distance (Levenshtein)? [closed]

阅读更多关于 How can I generate all variants of a word within 1-edit distance (Levenshtein)? [closed]

来源： https://stackoverflow.com/questions/38248575/how-can-i-generate-all-variants-of-a-word-within-1-edit-distance-levenshtein

Better fuzzy matching performance?

阅读更多关于 Better fuzzy matching performance?

问题 I'm currently using method get_close_matches method from difflib to iterate through a list of 15,000 strings to get the closest match against another list of approx 15,000 strings: a=['blah','pie','apple'...] b=['jimbo','zomg','pie'...] for value in a: difflib.get_close_matches(value,b,n=1,cutoff=.85) It takes .58 seconds per value which means it will take 8,714 seconds or 145 minutes to finish the loop. Is there another library/method that might be faster or a way to improve the speed for

Fuzzy matching a string in SQL

阅读更多关于 Fuzzy matching a string in SQL

问题 I have a User table, that has id , first_name , last_name , street_address , city , state , zip-code , firm , user_identifier , created_at , update_at . This table has a lot of duplication like the same users have been entered multiple times as a new user, so example id first_name last_name street_address user_identifier --------------------------------------------------------- 11 Mary Doe 123 Main Ave M2111111 --------------------------------------------------------- 21 Mary Doe 123 Main Ave

Fuzzy matching a string in in pyspark or SQL using Soundex function or Levenshtein distance

阅读更多关于 Fuzzy matching a string in in pyspark or SQL using Soundex function or Levenshtein distance

问题 I had to apply Levenshtein Function on last column when passport and country are same. matrix = passport_heck.select(\ f.col('name_id').alias('name_id_1'), f.col('last').alias('last_1'), f.col('country').alias('country_1'), f.col('passport').alias('passport_1')) \ .crossJoin(passport_heck.select(\ f.col('name_id').alias('name_id_2'), f.col('last').alias('last_2'), f.col('country').alias('country_2'), f.col('passport').alias('passport_2')))\ .filter((f.col('passport_1') == f.col('passport_2'))

Levenshtein distance with bound/limit

阅读更多关于 Levenshtein distance with bound/limit

问题 I have found some Python implementations of the Levenshtein distance. I am wondering though how these algorithms can be efficiently modified so that they break if the Levenshtein distance is greater than n (e.g. 3) instead of running until the end? So essentially I do not want to let the algorithm run for too long to calculate the final distance if I simply want to know if the distance is greater than a threshold or not. I have found some relevant posts here: Modifying Levenshtein Distance