fuzzy-search

How can I match fuzzy match strings from two datasets?

一个人想着一个人 提交于 2019-11-26 08:51:47
问题 I\'ve been working on a way to join two datasets based on a imperfect string, such as a name of a company. In the past I had to match two very dirty lists, one list had names and financial information, another list had names and address. Neither had unique IDs to match on! ASSUME THAT CLEANING HAS ALREADY BEEN APPLIED AND THERE MAYBE TYPOS AND INSERTIONS. So far AGREP is the closest tool I\'ve found that might work. I can use levenshtein distances in the AGREP package, which measure the

Javascript fuzzy search that makes sense

寵の児 提交于 2019-11-26 06:56:49
问题 I\'m looking for a fuzzy search JavaScript library to filter an array. I\'ve tried using fuzzyset.js and fuse.js, but the results are terrible (there are demos you can try on the linked pages). After doing some reading on Levenshtein distance, it strikes me as a poor approximation of what users are looking for when they type. For those who don\'t know, the system calculates how many insertions , deletions , and substitutions are needed to make two strings match. One obvious flaw, which is

Fuzzy string search library in Java [closed]

巧了我就是萌 提交于 2019-11-26 03:41:15
I'm looking for a high performance Java library for fuzzy string search. There are numerous algorithms to find similar strings, Levenshtein distance, Daitch-Mokotoff Soundex, n-grams etc. What Java implementations exists? Pros and cons for them? I'm aware of Lucene, any other solution or Lucene is best? I found these, does anyone have experience with them? SimMetrics NGramJ JodaStephen Commons Lang has an implementation of Levenshtein distance . Commons Codec has an implementation of soundex and metaphone . Henno Vermeulen You can use Apache Lucene, but depending on the use case this may be

How do I do a fuzzy match of company names in MYSQL with PHP for auto-complete?

纵然是瞬间 提交于 2019-11-26 03:27:58
问题 My users will import through cut and paste a large string that will contain company names. I have an existing and growing MYSQL database of companies names, each with a unique company_id. I want to be able to parse through the string and assign to each of the user-inputed company names a fuzzy match. Right now, just doing a straight-up string match, is also slow. ** Will Soundex indexing be faster? How can I give the user some options as they are typing? ** For example, someone writes:

A better similarity ranking algorithm for variable length strings

邮差的信 提交于 2019-11-26 03:24:15
问题 I\'m looking for a string similarity algorithm that yields better results on variable length strings than the ones that are usually suggested (levenshtein distance, soundex, etc). For example, Given string A: \"Robert\", Then string B: \"Amy Robertson\" would be a better match than String C: \"Richard\" Also, preferably, this algorithm should be language agnostic (also works in languages other than English). 回答1: Simon White of Catalysoft wrote an article about a very clever algorithm that

Merging two Data Frames using Fuzzy/Approximate String Matching in R

偶尔善良 提交于 2019-11-26 01:59:17
问题 DESCRIPTION I have two datasets with information that I need to merge. The only common fields that I have are strings that do not perfectly match and a numerical field that can be substantially different The only way to explain the problem is to show you the data. Here is a.csv and b.csv. I am trying to merge B to A. There are three fields in B and four in A. Company Name (File A Only), Fund Name, Asset Class, and Assets. So far, my focus has been on attempting to match the Fund Names by

Fuzzy string search library in Java [closed]

微笑、不失礼 提交于 2019-11-26 01:57:49
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed last year . I\'m looking for a high performance Java library for fuzzy string search. There are numerous algorithms to find similar strings, Levenshtein distance, Daitch-Mokotoff Soundex, n-grams etc. What Java implementations exists? Pros and cons for them? I\'m aware of Lucene, any other solution or Lucene is best? I found

Fuzzy matching using T-SQL

不打扰是莪最后的温柔 提交于 2019-11-26 00:56:24
I have a table Persons with personaldata and so on. There are lots of columns but the once of interest here are: addressindex , lastname and firstname where addressindex is a unique address drilled down to the door of the apartment. So if I have 'like below' two persons with the lastname and one the firstnames are the same they are most likely duplicates. I need a way to list these duplicates. tabledata: personid 1 firstname "Carl" lastname "Anderson" addressindex 1 personid 2 firstname "Carl Peter" lastname "Anderson" addressindex 1 I know how do this if I were to match exactly on all columns

Efficient string matching in Apache Spark

耗尽温柔 提交于 2019-11-26 00:19:53
问题 Using an OCR tool I extracted texts from screenshots (about 1-5 sentences each). However, when manually verifying the extracted text, I noticed several errors that occur from time to time. Given the text \"Hello there 😊! I really like Spark ❤️!\", I noticed that: 1) Letters like \"I\", \"!\", and \"l\" get replaced by \"|\". 2) Emojis are not correctly extracted and replaced by other characters or are left out. 3) Blank spaces are removed from time to time. As a result, I might end up with a

A better similarity ranking algorithm for variable length strings

泪湿孤枕 提交于 2019-11-25 23:12:26
I'm looking for a string similarity algorithm that yields better results on variable length strings than the ones that are usually suggested (levenshtein distance, soundex, etc). For example, Given string A: "Robert", Then string B: "Amy Robertson" would be a better match than String C: "Richard" Also, preferably, this algorithm should be language agnostic (also works in languages other than English). Simon White of Catalysoft wrote an article about a very clever algorithm that compares adjacent character pairs that works really well for my purposes: http://www.catalysoft.com/articles