levenshtein-distance

mySQL: Using Levenshtein distance to find duplicates in 20,000 rows

大憨熊 提交于 2021-01-29 04:01:10
问题 I basically have a two column table containing a primary key and names of companies with about 20,000 rows. My task is to find all duplicate entries. I originally tried using soundex, but it would match companies that were completely different, just because they had similar first words. So this led me on to the levenshtein distance algorithm. The problem is, the query takes an indefinite amount of time. I've left it for about 10 hours now, it still hasn't responded. Here is the query: SELECT

Smartest way to double loop over a data frame (comparing rows to each other with a Levenshtein Dist) in R?

╄→尐↘猪︶ㄣ 提交于 2021-01-27 17:10:12
问题 I cooked a df of paramStrings over several records: idName Str 1 Аэрофлот_Эконом 95111000210102121111010100111000100110101001 2 Аэрофлот_Комфорт 95111000210102121111010100111000100110101001 3 Аэрофлот_Бизнес 96111000210102121111010100111000100110101001 4 Трансаэро_Дисконт 26111000210102120000010100001010000010001000 5 Трансаэро_Туристический 26111000210002120000010100001010000010001000 6 Трансаэро_Эконом 26111000210002120000010100001010000010001000 Now I need to compare each one against

Difference in normalization of Levenshtein (edit) distance?

随声附和 提交于 2021-01-27 05:37:15
问题 If the Levenshtein distance between two strings, s and t is given by L(s,t) , what is the difference in the impact on the resulting heuristic of the following two different normalization schemes? L(s,t) / [length(s) + length(t)] L(s,t) / max[length(s), length(t)] (L(s,t)*2) / [length(s) + length(t)] I noticed that normalization approach 2 is recommended by the Levenshtein distance Wikipedia page but no mention is made of approach 1. Are both approaches equally valid? Just wondering if there

bitparallel weighted Levenshtein distance

僤鯓⒐⒋嵵緔 提交于 2021-01-07 01:32:07
问题 I am using a weighted Levenshtein distance with the following costs: insertion: 1 deletion: 1 replacement: 2 As pointed out by wildwasser in a comment, this means, that a substitution is treated as an insertion and a deletion. So substitutions could be avoided by the algorithm. For the normal implementation with a cost of 1 for each operation there are multiple bitparallel implementations like e.g. Myers/Hyyrö: static const uint64_t masks[64] = { 0x0000000000000001, 0x0000000000000003,

Better fuzzy matching performance?

风流意气都作罢 提交于 2020-07-05 04:39:06
问题 I'm currently using method get_close_matches method from difflib to iterate through a list of 15,000 strings to get the closest match against another list of approx 15,000 strings: a=['blah','pie','apple'...] b=['jimbo','zomg','pie'...] for value in a: difflib.get_close_matches(value,b,n=1,cutoff=.85) It takes .58 seconds per value which means it will take 8,714 seconds or 145 minutes to finish the loop. Is there another library/method that might be faster or a way to improve the speed for

Fuzzy matching a string in SQL

大憨熊 提交于 2020-04-11 04:19:28
问题 I have a User table, that has id , first_name , last_name , street_address , city , state , zip-code , firm , user_identifier , created_at , update_at . This table has a lot of duplication like the same users have been entered multiple times as a new user, so example id first_name last_name street_address user_identifier --------------------------------------------------------- 11 Mary Doe 123 Main Ave M2111111 --------------------------------------------------------- 21 Mary Doe 123 Main Ave

Fuzzy matching a string in in pyspark or SQL using Soundex function or Levenshtein distance

ぐ巨炮叔叔 提交于 2020-03-23 12:03:25
问题 I had to apply Levenshtein Function on last column when passport and country are same. matrix = passport_heck.select(\ f.col('name_id').alias('name_id_1'), f.col('last').alias('last_1'), f.col('country').alias('country_1'), f.col('passport').alias('passport_1')) \ .crossJoin(passport_heck.select(\ f.col('name_id').alias('name_id_2'), f.col('last').alias('last_2'), f.col('country').alias('country_2'), f.col('passport').alias('passport_2')))\ .filter((f.col('passport_1') == f.col('passport_2'))

Levenshtein distance with bound/limit

╄→гoц情女王★ 提交于 2020-03-14 19:06:09
问题 I have found some Python implementations of the Levenshtein distance. I am wondering though how these algorithms can be efficiently modified so that they break if the Levenshtein distance is greater than n (e.g. 3) instead of running until the end? So essentially I do not want to let the algorithm run for too long to calculate the final distance if I simply want to know if the distance is greater than a threshold or not. I have found some relevant posts here: Modifying Levenshtein Distance