fuzzy-search

How to find best fuzzy match for a string in a large string database

筅森魡賤 提交于 2019-11-27 05:36:57
问题 I have a database of strings (arbitrary length) which holds more than one million items (potentially more). I need to compare a user-provided string against the whole database and retrieve an identical string if it exists or otherwise return the closest fuzzy match(es) (60% similarity or better). The search time should ideally be under one second. My idea is to use edit distance for comparing each db string to the search string after narrowing down the candidates from the db based on their

Fuzzy string matching in Python

北慕城南 提交于 2019-11-27 05:30:42
问题 I have 2 lists of over a million names with slightly different naming conventions. The goal here it to match those records that are similar, with the logic of 95% confidence. I am made aware there are libraries which I can leverage on, such as the FuzzyWuzzy module in Python. However in terms of processing it seems it will take up too much resources having every string in 1 list to be compared to the other, which in this case seems to require 1 million multiplied by another million number of

Apply fuzzy matching across a dataframe column and save results in a new column

 ̄綄美尐妖づ 提交于 2019-11-27 01:30:20
I have two data frames with each having a different number of rows. Below is a couple rows from each data set df1 = Company City State ZIP FREDDIE LEES AMERICAN GOURMET SAUCE St. Louis MO 63101 CITYARCHRIVER 2015 FOUNDATION St. Louis MO 63102 GLAXOSMITHKLINE CONSUMER HEALTHCARE St. Louis MO 63102 LACKEY SHEET METAL St. Louis MO 63102 and df2 = FDA Company FDA City FDA State FDA ZIP LACKEY SHEET METAL St. Louis MO 63102 PRIMUS STERILIZER COMPANY LLC Great Bend KS 67530 HELGET GAS PRODUCTS INC Omaha NE 68127 ORTHOQUEST LLC La Vista NE 68128 I joined them side by side using combined_data = pandas

Best Fuzzy Matching Algorithm? [closed]

柔情痞子 提交于 2019-11-27 00:28:13
问题 What is the best Fuzzy Matching Algorithm (Fuzzy Logic, N-Gram, Levenstein, Soundex ....,) to process more than 100000 records in less time? 回答1: I suggest you read the articles by Navarro mentioned in the Refences section of the Wikipedia article titled Approximate string matching. Making your decision based on actual research is always better than on suggestions by random strangers.. Especially if performance on a known set of records is important to you. 回答2: It massively depends on your

How to create simple fuzzy search with Postgresql only?

我与影子孤独终老i 提交于 2019-11-26 23:57:54
问题 I have a little problem with search functionality on my RoR based site. I have many Produts with some CODEs. This code can be any string like "AB-123-lHdfj". Now I use ILIKE operator to find products: Product.where("code ILIKE ?", "%" + params[:search] + "%") It works fine, but it can't find product with codes like "AB123-lHdfj", or "AB123lHdfj". What should I do for this? May be postgresql has some string normalization function, or some other methods to help me? :) 回答1: Postgres provides a

Lucene query: bla~* (match words that start with something fuzzy), how?

不打扰是莪最后的温柔 提交于 2019-11-26 22:58:55
问题 In the Lucene query syntax I'd like to combine * and ~ in a valid query similar to: bla~* //invalid query Meaning: Please match words that begin with "bla" or something similar to "bla". Update : What I do now, works for small input, is use the following (snippet of SOLR schema): <fieldtype name="text_ngrams" class="solr.TextField"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"

Javascript fuzzy search that makes sense

江枫思渺然 提交于 2019-11-26 21:24:19
I'm looking for a fuzzy search JavaScript library to filter an array. I've tried using fuzzyset.js and fuse.js , but the results are terrible (there are demos you can try on the linked pages). After doing some reading on Levenshtein distance, it strikes me as a poor approximation of what users are looking for when they type. For those who don't know, the system calculates how many insertions , deletions , and substitutions are needed to make two strings match. One obvious flaw, which is fixed in the Levenshtein-Demerau model, is that both blub and boob are considered equally similar to bulb

How can I match fuzzy match strings from two datasets?

时光怂恿深爱的人放手 提交于 2019-11-26 17:32:25
I've been working on a way to join two datasets based on a imperfect string, such as a name of a company. In the past I had to match two very dirty lists, one list had names and financial information, another list had names and address. Neither had unique IDs to match on! ASSUME THAT CLEANING HAS ALREADY BEEN APPLIED AND THERE MAYBE TYPOS AND INSERTIONS. So far AGREP is the closest tool I've found that might work. I can use levenshtein distances in the AGREP package, which measure the number of deletions, insertions and substitutions between two strings. AGREP will return the string with the

Merging two Data Frames using Fuzzy/Approximate String Matching in R

心不动则不痛 提交于 2019-11-26 11:45:55
DESCRIPTION I have two datasets with information that I need to merge. The only common fields that I have are strings that do not perfectly match and a numerical field that can be substantially different The only way to explain the problem is to show you the data. Here is a.csv and b.csv . I am trying to merge B to A. There are three fields in B and four in A. Company Name (File A Only), Fund Name, Asset Class, and Assets. So far, my focus has been on attempting to match the Fund Names by replacing words or parts of the strings to create exact matches and then using: a <- read.table(file =

Apply fuzzy matching across a dataframe column and save results in a new column

只谈情不闲聊 提交于 2019-11-26 09:40:12
问题 I have two data frames with each having a different number of rows. Below is a couple rows from each data set df1 = Company City State ZIP FREDDIE LEES AMERICAN GOURMET SAUCE St. Louis MO 63101 CITYARCHRIVER 2015 FOUNDATION St. Louis MO 63102 GLAXOSMITHKLINE CONSUMER HEALTHCARE St. Louis MO 63102 LACKEY SHEET METAL St. Louis MO 63102 and df2 = FDA Company FDA City FDA State FDA ZIP LACKEY SHEET METAL St. Louis MO 63102 PRIMUS STERILIZER COMPANY LLC Great Bend KS 67530 HELGET GAS PRODUCTS INC