fuzzy-comparison

Quicker way to perform fuzzy string match in pandas

℡╲_俬逩灬. 提交于 2019-12-18 05:27:09
问题 Is there any way to speed up the fuzzy string match using fuzzywuzzy in pandas. I have a dataframe as extra_names which has names that I want to run fuzzy matches for with another dataframe as names_df . >> extra_names.head() not_matching 0 Vij Sales 1 Crom Electronics 2 REL Digital 3 Bajaj Elec 4 Reliance Digi >> len(extra_names) 6500 >> names_df.head() names types 0 Vijay Sales 1 1 Croma Electronics 1 2 Reliance Digital 2 3 Bajaj Electronics 2 4 Pai Electricals 2 >> len(names_df) 250 As of

Fuzzy Regular Expressions

狂风中的少年 提交于 2019-12-17 21:48:29
问题 In my work I have with great results used approximate string matching algorithms such as Damerau–Levenshtein distance to make my code less vulnerable to spelling mistakes. Now I have a need to match strings against simple regular expressions such TV Schedule for \d\d (Jan|Feb|Mar|...) . This means that the string TV Schedule for 10 Jan should return 0 while T Schedule for 10. Jan should return 2. This could be done by generating all strings in the regex (in this case 100x12) and find the best

Fuzzy String Comparison

╄→尐↘猪︶ㄣ 提交于 2019-12-17 05:36:26
问题 What I am striving to complete is a program which reads in a file and will compare each sentence according to the original sentence. The sentence which is a perfect match to the original will receive a score of 1 and a sentence which is the total opposite will receive a 0. All other fuzzy sentences will receive a grade in between 1 and 0. I am unsure which operation to use to allow me to complete this in Python 3. I have included the sample text in which the Text 1 is the original and the

Match two datasets across multiple ‘dirty’ columns in R

淺唱寂寞╮ 提交于 2019-12-13 03:33:55
问题 I frequently need to match two datasets by multiple matching columns, for two reasons. First, each of these characteristics are ‘dirty’, meaning a single column does not consistently match even when it should (for a truly matching row). Second, the characteristics are not unique (e.g., male and female). Matching like this is useful for matching across time (pre-test with post-test scores), different data modalities (observed characteristics and lab values), or multiple datasets for research

Using Jaro-Winkler, is distance between A and B the same as B and A?

北城以北 提交于 2019-12-13 03:19:22
问题 I'm using the following class to calculate the Jaro-Winkler distance between two strings. What I'm noticing is that the distance calculated between string A and B is not always the same as string B and A. Is this to be expected? RAMADI ~ TRADING 0.73492063492063 TRADING ~ RAMADI 0.71825396825397 Demo 回答1: Turns out, there is a bug in the PHP versions of the Jaro-Winkler string comparison method found many places online. Currently, string A compared to string B will yield a different result to

difflib on Ruby [closed]

倾然丶 夕夏残阳落幕 提交于 2019-12-10 14:42:41
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 5 years ago . Is there a library similar to Python's difflib on Ruby? Particularly, I need one that has a method similar to difflib.get_close_matches. Any recommendations? 回答1: After some research, I suggest using amatch or SimMetrics (with JRuby) and manually implement the get_close_matches method. Both libs offer

R - Merging two data files based on partial matching of inconsistent full name formats

筅森魡賤 提交于 2019-12-08 04:28:05
问题 Here is my previous question reposted with R format. I'm looking for a way to merge two data files based on partial matching of participants' full names that are sometimes entered in different formats and sometimes misspelled. I know there are some different function options for partial matches (eg agrep and pmatch) and for merging data files but I need help with a) combining the two; b) doing partial matching that can ignore middle names; c) in the merged data file store both original name

Fuzzy match row in one column with same row in next column

蹲街弑〆低调 提交于 2019-12-07 21:38:53
问题 I would like to find information in one column based on the other column. So I have some words in one column and complete sentences in another. I would like to know whether it finds the words in those sentences. But sometimes the words are not the same so I cannot use the SQL like function. Thus I think fuzzy matching + some sort of 'like' function would be helpful as the data looks like this: Names Sentences Airplanes Sarl Airplanes-Sàrl is part of Airplanes-Group Sarl. Kidco Ltd. 100%

Position of Approximate Substring Matches in R

谁说胖子不能爱 提交于 2019-12-07 18:56:46
问题 I'm using R for string processing. I have a data frame with a column of strings, say: df <- data.frame(textcol=c("In this substring would like to find the position of this substring", "I would also like to find the position of thes substring", "No match here","No mention of this substrangy thing")) matchPattern <- "this substring" I am searching for a function that (depending on a distance parameter of some sort, say Jarro-Winkler) would take my matchPattern, compare it to every row of the

How to normalize company names

爷,独闯天下 提交于 2019-12-07 17:16:04
问题 We have user generated names of employers that come in all variations. For example, people have typed in or imported: Google Google, Inc. Google Inc. Google inc To a database search this, looks like a different company all together. We've changed some things to map each employer to a "normalized" name, but with 70,000 in total, it becomes hard to do it by hand. Does anyone have suggestions on how to normalize the existing entries, and also how to maintain we do it for all incoming names as