fuzzy-search

Fuzzy string matching in Python

跟風遠走 提交于 2019-11-28 17:59:31
I have 2 lists of over a million names with slightly different naming conventions. The goal here it to match those records that are similar, with the logic of 95% confidence. I am made aware there are libraries which I can leverage on, such as the FuzzyWuzzy module in Python. However in terms of processing it seems it will take up too much resources having every string in 1 list to be compared to the other, which in this case seems to require 1 million multiplied by another million number of iterations. Are there any other more efficient methods for this problem? UPDATE: So I created a

Best Fuzzy Matching Algorithm? [closed]

蹲街弑〆低调 提交于 2019-11-28 17:15:23
What is the best Fuzzy Matching Algorithm (Fuzzy Logic, N-Gram, Levenstein, Soundex ....,) to process more than 100000 records in less time? Tim I suggest you read the articles by Navarro mentioned in the Refences section of the Wikipedia article titled Approximate string matching . Making your decision based on actual research is always better than on suggestions by random strangers.. Especially if performance on a known set of records is important to you. It massively depends on your data. Certain records can be matched better than others. For example postcode is a defined format so can be

Searching names with Apache Solr

一笑奈何 提交于 2019-11-28 16:35:17
I've just ventured into the seemingly simple but extremely complex world of searching. For an application, I am required to build a search mechanism for searching users by their names. After reading numerous posts and articles including: How can I use Lucene for personal name (first name, last name) search? http://dublincore.org/documents/1998/02/03/name-representation/ what's the best way to search a social network by prioritizing a users relationships first? http://www.gossamer-threads.com/lists/lucene/java-user/120417 Lucene Index and Query Design Question - Searching People Lucene Fuzzy

Fuzzy Regular Expressions

旧街凉风 提交于 2019-11-28 16:30:27
In my work I have with great results used approximate string matching algorithms such as Damerau–Levenshtein distance to make my code less vulnerable to spelling mistakes. Now I have a need to match strings against simple regular expressions such TV Schedule for \d\d (Jan|Feb|Mar|...) . This means that the string TV Schedule for 10 Jan should return 0 while T Schedule for 10. Jan should return 2. This could be done by generating all strings in the regex (in this case 100x12) and find the best match, but that doesn't seam practical. Do you have any ideas how to do this effectively? Thomas Ahle

Fuzzy matching of product names

ぃ、小莉子 提交于 2019-11-28 16:01:16
I need to automatically match product names (cameras, laptops, tv-s etc) that come from different sources to a canonical name in the database. For example "Canon PowerShot a20IS" , "NEW powershot A20 IS from Canon" and "Digital Camera Canon PS A20IS" should all match "Canon PowerShot A20 IS" . I've worked with levenshtein distance with some added heuristics (removing obvious common words, assigning higher cost to number changes etc), which works to some extent, but not well enough unfortunately. The main problem is that even single-letter changes in relevant keywords can make a huge difference

Algorithms for “fuzzy matching” strings

最后都变了- 提交于 2019-11-28 15:32:39
By fuzzy matching I don't mean similar strings by Levenshtein distance or something similar, but the way it's used in TextMate/Ido/Icicles: given a list of strings, find those which include all characters in the search string, but possibly with other characters between, preferring the best fit. I've finally understood what you were looking for. The issue is interesting however looking at the 2 algorithms you found it seems that people have widely different opinions about the specifications ;) I think it would be useful to state the problem and the requirements more clearly. Problem : We are

Fuzzy search algorithm (approximate string matching algorithm)

给你一囗甜甜゛ 提交于 2019-11-28 15:31:58
问题 I wish to create a fuzzy search algorithm. However, upon hours of research I am really struggling. I want to create an algorithm that performs a fuzzy search on a list of names of schools. This is what I have looked at so far: Most of my research keep pointing to " string metrics " on Google and Stackoverflow such as: Levenshtein distance Damerau-Levenshtein distance Needleman–Wunsch algorithm However this just gives a score of how similar 2 strings are. The only way I can think of

How to find best fuzzy match for a string in a large string database

混江龙づ霸主 提交于 2019-11-28 06:26:15
I have a database of strings (arbitrary length) which holds more than one million items (potentially more). I need to compare a user-provided string against the whole database and retrieve an identical string if it exists or otherwise return the closest fuzzy match(es) (60% similarity or better). The search time should ideally be under one second. My idea is to use edit distance for comparing each db string to the search string after narrowing down the candidates from the db based on their length. However, as I will need to perform this operation very often, I'm thinking about building an

How to create simple fuzzy search with Postgresql only?

时间秒杀一切 提交于 2019-11-28 03:20:34
I have a little problem with search functionality on my RoR based site. I have many Produts with some CODEs. This code can be any string like "AB-123-lHdfj". Now I use ILIKE operator to find products: Product.where("code ILIKE ?", "%" + params[:search] + "%") It works fine, but it can't find product with codes like "AB123-lHdfj", or "AB123lHdfj". What should I do for this? May be postgresql has some string normalization function, or some other methods to help me? :) Paul Sasik Postgres provides a module with several string comparsion functions such as soundex and metaphone. But you will want

Python Pandas fuzzy merge/match with duplicates

不羁的心 提交于 2019-11-27 22:32:52
问题 I have 2 dataframes currently, 1 for donors and 1 for fundraisers. Ideally what I'm trying to find is if any fundraisers also gave donations and if so copy some of that information into my fundraiser data set (donor name, email and their first donation). Problems with my data are 1) I need to match by name and email, but a user might have slightly different names (ex Kat and Kathy). 2) Duplicate names for donors and fundraisers. 2a) With donors I can get unique name/email combinations since I