fuzzy-comparison

Match slightly different records in a field

纵饮孤独 提交于 2021-01-29 21:58:27
问题 I have the below table HAVE. How can I go about getting results in "WANT" ? I'll appreciate ideas and I'm open to any fuzzy match algorithm out there Have ID Name 1 Davi 2 David 3 DAVID 4 Micheal 5 Michael 6 Oracle 7 Tepper WANT ID Name mtch_ind 1 Davi 1 2 David 1 3 DAVID 1 4 Micheal 2 5 Michael 2 6 Oracle 3 7 Tepper 4 TABLE DDL and record insert CREATE TABLE HAVE ( ID INTEGER, Name VARCHAR(10) ); INSERT INTO data VALUES ('1', 'Davi'); INSERT INTO data VALUES ('2', 'David'); INSERT INTO data

Pandas replace strings with fuzzy match in the same column

吃可爱长大的小学妹 提交于 2021-01-29 13:27:28
问题 I have a column in a dataframe that is like this: OWNER -------------- OTTO J MAYER OTTO MAYER DANIEL J ROSEN DANIEL ROSSY LISA CULLI LISA CULLY LISA CULLY CITY OF BELMONT CITY OF BELMONT CITY Some of the names in my data frame are misspelled or having extra/missing characters. I need a column where the names are replaced by any close match in the same column. However, all the similar names need to be group by under one same name. For example this is I what I expect from the data frame above:

SQL Query Find Exact and Near Dupes

别说谁变了你拦得住时间么 提交于 2021-01-29 08:26:30
问题 I have a SQL table with FirstName, LastName, Add1 and other fields. I am working to get this data cleaned up. There are a few instances of likely dupes - All 3 columns are the exact same for more than 1 record The First and Last are the same, only 1 has an address, the other is blank The First and Last are similar (John | Doe vs John C. | Doe) and the address is the same or one is blank I'm wanting to generate a query I can provide to the users, so they can check these records out, compare

Fuzzy merging in R - seeking help to improve my code

一个人想着一个人 提交于 2021-01-20 19:53:24
问题 Inspired by the experimental fuzzy_join function from the statar package I wrote a function myself which combines exact and fuzzy (by string distances) matching. The merging job I have to do is quite big (resulting into multiple string distance matrices with a little bit less than one billion cells) and I had the impression that the fuzzy_join function is not written very efficiently (with regard to memory usage) and the parallelization is implemented in a weird manner (the computation of the

Fuzzy merging in R - seeking help to improve my code

冷暖自知 提交于 2021-01-20 19:51:36
问题 Inspired by the experimental fuzzy_join function from the statar package I wrote a function myself which combines exact and fuzzy (by string distances) matching. The merging job I have to do is quite big (resulting into multiple string distance matrices with a little bit less than one billion cells) and I had the impression that the fuzzy_join function is not written very efficiently (with regard to memory usage) and the parallelization is implemented in a weird manner (the computation of the

Fuzzy merging in R - seeking help to improve my code

那年仲夏 提交于 2021-01-20 19:50:53
问题 Inspired by the experimental fuzzy_join function from the statar package I wrote a function myself which combines exact and fuzzy (by string distances) matching. The merging job I have to do is quite big (resulting into multiple string distance matrices with a little bit less than one billion cells) and I had the impression that the fuzzy_join function is not written very efficiently (with regard to memory usage) and the parallelization is implemented in a weird manner (the computation of the

Better fuzzy matching performance?

风流意气都作罢 提交于 2020-07-05 04:39:06
问题 I'm currently using method get_close_matches method from difflib to iterate through a list of 15,000 strings to get the closest match against another list of approx 15,000 strings: a=['blah','pie','apple'...] b=['jimbo','zomg','pie'...] for value in a: difflib.get_close_matches(value,b,n=1,cutoff=.85) It takes .58 seconds per value which means it will take 8,714 seconds or 145 minutes to finish the loop. Is there another library/method that might be faster or a way to improve the speed for

How do I fuzzy match just adjacent cells?

假装没事ソ 提交于 2020-06-11 10:01:11
问题 I have a row of 10,000 names in two corresponding columns, 10,000 in each. Each cell in Column A corresponds to the adjacent cell in Column B. I want to do a fuzzy match and get a compatibility score on all of them just with the adjacent cell. I do not want it to search entire column versus entire column, just adjacent cells, which I don't seem to be able to do with the Fuzzy Match Excel add in, ideas? Example: Column A: Column B: Value: Apple Aplle 80% Banana Banana 100% Orange Ornge 85% 回答1