fuzzy-search

Lucene query: bla~* (match words that start with something fuzzy), how?

南楼画角 提交于 2019-11-27 20:49:26
In the Lucene query syntax I'd like to combine * and ~ in a valid query similar to: bla~* //invalid query Meaning: Please match words that begin with "bla" or something similar to "bla". Update : What I do now, works for small input, is use the following (snippet of SOLR schema): <fieldtype name="text_ngrams" class="solr.TextField"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0"/> <filter class=

Lightweight fuzzy search library

我的未来我决定 提交于 2019-11-27 20:49:19
问题 Can you suggest some light weight fuzzy text search library? What I want to do is to allow users to find correct data for search terms with typos. I could use full-text search engines like Lucene, but I think it's an overkill. Edit: To make question more clear here is a main scenario for that library: I have a large list of strings. I want to be able to search in this list (something like MSVS' intellisense) but it should be possible to filter this list by string which is not present in it

PHP/MySQL small-scale fuzzy search

风流意气都作罢 提交于 2019-11-27 14:13:33
问题 I'm looking to implement fuzzy search for a small PHP/MySQL application. Specifically, I have a database with about 2400 records (records added at a rate of about 600 per year, so it's a small database). The three fields of interest are street address, last name and date. I want to be able to search by one of those fields, and essentially have tolerance for spelling/character errors. i.e., an address of "123 Main Street" should also match "123 Main St", "123 Main St.", "123 Mian St", "123 Man

Similarity function in Postgres with pg_trgm

為{幸葍}努か 提交于 2019-11-27 13:00:19
问题 I'm trying to use the similarity function in Postgres to do some fuzzy text matching, however whenever I try to use it I get the error: function similarity(character varying, unknown) does not exist If I add explicit casts to text I get the error: function similarity(text, text) does not exist My query is: SELECT (similarity("table"."field"::text, %s::text)) AS "similarity", "table".* FROM "table" WHERE similarity > .5 ORDER BY "similarity" DESC LIMIT 10 Do I need to do something to initalize

Levenshtein distance based methods Vs Soundex

北城以北 提交于 2019-11-27 12:53:35
As per this comment in a related thread, I'd like to know why Levenshtein distance based methods are better than Soundex. Soundex is rather primitive - it was originally developed to be hand calculated. It results in a key that can be compared. Soundex works well with western names, as it was originally developed for US census data. It's intended for phonetic comparison. Levenshtein distance looks at two values and produces a value based on their similarity. It's looking for missing or substituted letters. Basically Soundex is better for finding that "Schmidt" and "Smith" might be the same

SQL Fuzzy Matching

别等时光非礼了梦想. 提交于 2019-11-27 12:30:37
Hope i am not repeating this question. I did some search here and google before posting here. I am running a eStore with SQL Server 2008R2 with Full Text enabled. My requirements, There is a Product Table, which has product name, OEM Codes, Model which this product fits into. All are in text. I have created a new column called TextSearch. This has concatenated values of Product Name, OEM Code and Model which this product fits in. These values are comma separated. When a customer enters a keyword, we run search on TextSearch column to match for products. See matching logic below. I am using a

Fuzzy text (sentences/titles) matching in C#

守給你的承諾、 提交于 2019-11-27 10:53:37
问题 Hey, I'm using Levenshteins algorithm to get distance between source and target string. also I have method which returns value from 0 to 1: /// <summary> /// Gets the similarity between two strings. /// All relation scores are in the [0, 1] range, /// which means that if the score gets a maximum value (equal to 1) /// then the two string are absolutely similar /// </summary> /// <param name="string1">The string1.</param> /// <param name="string2">The string2.</param> /// <returns></returns>

Searching names with Apache Solr

大兔子大兔子 提交于 2019-11-27 09:47:52
问题 I've just ventured into the seemingly simple but extremely complex world of searching. For an application, I am required to build a search mechanism for searching users by their names. After reading numerous posts and articles including: How can I use Lucene for personal name (first name, last name) search? http://dublincore.org/documents/1998/02/03/name-representation/ what's the best way to search a social network by prioritizing a users relationships first? http://www.gossamer-threads.com

Fuzzy matching of product names

人盡茶涼 提交于 2019-11-27 09:30:06
问题 I need to automatically match product names (cameras, laptops, tv-s etc) that come from different sources to a canonical name in the database. For example "Canon PowerShot a20IS" , "NEW powershot A20 IS from Canon" and "Digital Camera Canon PS A20IS" should all match "Canon PowerShot A20 IS" . I've worked with levenshtein distance with some added heuristics (removing obvious common words, assigning higher cost to number changes etc), which works to some extent, but not well enough

Algorithms for “fuzzy matching” strings

ε祈祈猫儿з 提交于 2019-11-27 09:16:09
问题 By fuzzy matching I don't mean similar strings by Levenshtein distance or something similar, but the way it's used in TextMate/Ido/Icicles: given a list of strings, find those which include all characters in the search string, but possibly with other characters between, preferring the best fit. 回答1: I've finally understood what you were looking for. The issue is interesting however looking at the 2 algorithms you found it seems that people have widely different opinions about the