fuzzy-search

compiling a fuzzy regexp with python regex

时光毁灭记忆、已成空白 提交于 2019-12-01 06:59:59
问题 When I found out that the python regex module allows fuzzy matching I was increasingly happy as it seemed as a simple solution to many of my problems. But now I am having a problem for which I did not find any answers from documentation. How could I compile Strings into regexps using also the new fuzziness value feature? To illustrate my usual needs and give a sample a little piece of code import regex f = open('liner.fa', 'r') nosZ2f='TTCCGACTACCAAGGCAAATACTGCTTCTCGAC' nosZ2r=

Checking fuzzy/approximate substring existing in a longer string, in Python?

我与影子孤独终老i 提交于 2019-11-30 10:56:41
问题 Using algorithms like leveinstein ( leveinstein or difflib) , it is easy to find approximate matches.eg. >>> import difflib >>> difflib.SequenceMatcher(None,"amazing","amaging").ratio() 0.8571428571428571 The fuzzy matches can be detected by deciding a threshold as needed. Current requirement : To find fuzzy substring based on a threshold in a bigger string. eg. large_string = "thelargemanhatanproject is a great project in themanhattincity" query_string = "manhattan" #result = "manhatan",

ElasticSearch multi_match query over multiple fields with Fuzziness

时光毁灭记忆、已成空白 提交于 2019-11-30 02:54:11
问题 How can I add fuzziness to a multi_match query? So if someone is to search for 'basball' it would still find 'baseball' articles. Currently my query looks like this: POST /newspaper/articles/_search { "query": { "function_score": { "query": { "multi_match": { "query": "baseball", "type": "phrase", "fields": [ "subject^3", "section^2.5", "article^2", "tags^1.5", "notes^1" ] } } } } } One option I was looking at is to do something like this, just don't know if this is the best option. It's

How can I create an index with pymongo [duplicate]

拜拜、爱过 提交于 2019-11-30 00:37:45
问题 This question already has answers here : Recommended way/place to create index on MongoDB collection for a web application (3 answers) Closed last year . I want to enable text-search at a specific field in my Mongo DB. I want to implement this search in python (-> pymongo). When I follow the instructions given in the internet: db.foo.ensure_index(('field_i_want_to_index', 'text'), name="search_index") I get the following error message: Traceback (most recent call last): File "CVE_search.py",

Fuzzy search algorithm (approximate string matching algorithm)

[亡魂溺海] 提交于 2019-11-29 18:59:53
I wish to create a fuzzy search algorithm. However, upon hours of research I am really struggling. I want to create an algorithm that performs a fuzzy search on a list of names of schools. This is what I have looked at so far: Most of my research keep pointing to " string metrics " on Google and Stackoverflow such as: Levenshtein distance Damerau-Levenshtein distance Needleman–Wunsch algorithm However this just gives a score of how similar 2 strings are. The only way I can think of implementing it as a search algorithm is to perform a linear search and executing the string metric algorithm for

Python Pandas fuzzy merge/match with duplicates

血红的双手。 提交于 2019-11-29 04:53:32
I have 2 dataframes currently, 1 for donors and 1 for fundraisers. Ideally what I'm trying to find is if any fundraisers also gave donations and if so copy some of that information into my fundraiser data set (donor name, email and their first donation). Problems with my data are 1) I need to match by name and email, but a user might have slightly different names (ex Kat and Kathy). 2) Duplicate names for donors and fundraisers. 2a) With donors I can get unique name/email combinations since I just care about the first donation date 2b) With fundraisers though I need to keep both rows and not

Python Fuzzy Matching (FuzzyWuzzy) - Keep only Best Match

房东的猫 提交于 2019-11-29 00:06:50
I'm trying to fuzzy match two csv files, each containing one column of names, that are similar but not the same. My code so far is as follows: import pandas as pd from pandas import DataFrame from fuzzywuzzy import process import csv save_file = open('fuzzy_match_results.csv', 'w') writer = csv.writer(save_file, lineterminator = '\n') def parse_csv(path): with open(path,'r') as f: reader = csv.reader(f, delimiter=',') for row in reader: yield row if __name__ == "__main__": ## Create lookup dictionary by parsing the products csv data = {} for row in parse_csv('names_1.csv'): data[row[0]] = row

PHP/MySQL small-scale fuzzy search

旧巷老猫 提交于 2019-11-28 22:05:52
I'm looking to implement fuzzy search for a small PHP/MySQL application. Specifically, I have a database with about 2400 records (records added at a rate of about 600 per year, so it's a small database). The three fields of interest are street address, last name and date. I want to be able to search by one of those fields, and essentially have tolerance for spelling/character errors. i.e., an address of "123 Main Street" should also match "123 Main St", "123 Main St.", "123 Mian St", "123 Man St", "132 Main St", etc. and likewise for name and date. The main issues I have with answers to other

Lightweight fuzzy search library

依然范特西╮ 提交于 2019-11-28 20:43:46
Can you suggest some light weight fuzzy text search library? What I want to do is to allow users to find correct data for search terms with typos. I could use full-text search engines like Lucene, but I think it's an overkill. Edit: To make question more clear here is a main scenario for that library: I have a large list of strings. I want to be able to search in this list (something like MSVS' intellisense) but it should be possible to filter this list by string which is not present in it but close enough to some string which is in the list. Example: Red Green Blue When I type 'Gren' or 'Geen

Similarity function in Postgres with pg_trgm

匆匆过客 提交于 2019-11-28 20:36:43
I'm trying to use the similarity function in Postgres to do some fuzzy text matching, however whenever I try to use it I get the error: function similarity(character varying, unknown) does not exist If I add explicit casts to text I get the error: function similarity(text, text) does not exist My query is: SELECT (similarity("table"."field"::text, %s::text)) AS "similarity", "table".* FROM "table" WHERE similarity > .5 ORDER BY "similarity" DESC LIMIT 10 Do I need to do something to initalize pg_trgm? You have to install pg_trgm. In debian, source this sql: /usr/share/postgresql/8.4/contrib/pg