fuzzy-search | 易学教程

Lucene query: bla~* (match words that start with something fuzzy), how?

阅读更多关于 Lucene query: bla~* (match words that start with something fuzzy), how?

In the Lucene query syntax I'd like to combine * and ~ in a valid query similar to: bla~* //invalid query Meaning: Please match words that begin with "bla" or something similar to "bla". Update : What I do now, works for small input, is use the following (snippet of SOLR schema): <fieldtype name="text_ngrams" class="solr.TextField"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0"/> <filter class=

Lightweight fuzzy search library

阅读更多关于 Lightweight fuzzy search library

问题 Can you suggest some light weight fuzzy text search library? What I want to do is to allow users to find correct data for search terms with typos. I could use full-text search engines like Lucene, but I think it's an overkill. Edit: To make question more clear here is a main scenario for that library: I have a large list of strings. I want to be able to search in this list (something like MSVS' intellisense) but it should be possible to filter this list by string which is not present in it

PHP/MySQL small-scale fuzzy search

阅读更多关于 PHP/MySQL small-scale fuzzy search

问题 I'm looking to implement fuzzy search for a small PHP/MySQL application. Specifically, I have a database with about 2400 records (records added at a rate of about 600 per year, so it's a small database). The three fields of interest are street address, last name and date. I want to be able to search by one of those fields, and essentially have tolerance for spelling/character errors. i.e., an address of "123 Main Street" should also match "123 Main St", "123 Main St.", "123 Mian St", "123 Man

Similarity function in Postgres with pg_trgm

阅读更多关于 Similarity function in Postgres with pg_trgm

问题 I'm trying to use the similarity function in Postgres to do some fuzzy text matching, however whenever I try to use it I get the error: function similarity(character varying, unknown) does not exist If I add explicit casts to text I get the error: function similarity(text, text) does not exist My query is: SELECT (similarity("table"."field"::text, %s::text)) AS "similarity", "table".* FROM "table" WHERE similarity > .5 ORDER BY "similarity" DESC LIMIT 10 Do I need to do something to initalize

Levenshtein distance based methods Vs Soundex

阅读更多关于 Levenshtein distance based methods Vs Soundex

As per this comment in a related thread, I'd like to know why Levenshtein distance based methods are better than Soundex. Soundex is rather primitive - it was originally developed to be hand calculated. It results in a key that can be compared. Soundex works well with western names, as it was originally developed for US census data. It's intended for phonetic comparison. Levenshtein distance looks at two values and produces a value based on their similarity. It's looking for missing or substituted letters. Basically Soundex is better for finding that "Schmidt" and "Smith" might be the same

SQL Fuzzy Matching

阅读更多关于 SQL Fuzzy Matching

Hope i am not repeating this question. I did some search here and google before posting here. I am running a eStore with SQL Server 2008R2 with Full Text enabled. My requirements, There is a Product Table, which has product name, OEM Codes, Model which this product fits into. All are in text. I have created a new column called TextSearch. This has concatenated values of Product Name, OEM Code and Model which this product fits in. These values are comma separated. When a customer enters a keyword, we run search on TextSearch column to match for products. See matching logic below. I am using a

Fuzzy text (sentences/titles) matching in C#

阅读更多关于 Fuzzy text (sentences/titles) matching in C#

问题 Hey, I'm using Levenshteins algorithm to get distance between source and target string. also I have method which returns value from 0 to 1: /// <summary> /// Gets the similarity between two strings. /// All relation scores are in the [0, 1] range, /// which means that if the score gets a maximum value (equal to 1) /// then the two string are absolutely similar /// </summary> /// <param name="string1">The string1.</param> /// <param name="string2">The string2.</param> /// <returns></returns>

Searching names with Apache Solr

阅读更多关于 Searching names with Apache Solr

问题 I've just ventured into the seemingly simple but extremely complex world of searching. For an application, I am required to build a search mechanism for searching users by their names. After reading numerous posts and articles including: How can I use Lucene for personal name (first name, last name) search? http://dublincore.org/documents/1998/02/03/name-representation/ what's the best way to search a social network by prioritizing a users relationships first? http://www.gossamer-threads.com

Fuzzy matching of product names

阅读更多关于 Fuzzy matching of product names

问题 I need to automatically match product names (cameras, laptops, tv-s etc) that come from different sources to a canonical name in the database. For example "Canon PowerShot a20IS" , "NEW powershot A20 IS from Canon" and "Digital Camera Canon PS A20IS" should all match "Canon PowerShot A20 IS" . I've worked with levenshtein distance with some added heuristics (removing obvious common words, assigning higher cost to number changes etc), which works to some extent, but not well enough

Algorithms for “fuzzy matching” strings

阅读更多关于 Algorithms for “fuzzy matching” strings

问题 By fuzzy matching I don't mean similar strings by Levenshtein distance or something similar, but the way it's used in TextMate/Ido/Icicles: given a list of strings, find those which include all characters in the search string, but possibly with other characters between, preferring the best fit. 回答1: I've finally understood what you were looking for. The issue is interesting however looking at the 2 algorithms you found it seems that people have widely different opinions about the