How to find best fuzzy match for a string in a large string database

前端未结

关注

 7  1822

不要未来只要你来

I have a database of strings (arbitrary length) which holds more than one million items (potentially more).

I need to compare a user-provided string against the whol

相关标签:

7条回答

挽巷

2020-12-08 11:23

A very extensive explanation of relevant algorithms is in the book Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology by Dan Gusfield.

0 讨论(0)
发布评论:

提交评论
- 加载中...
慢半拍i

2020-12-08 11:24

You didn't mention your database system, but for PostrgreSQL you could use the following contrib module: trgm - Trigram matching for PostgreSQL

The pg_trgm contrib module provides functions and index classes for determining the similarity of text based on trigram matching.

0 讨论(0)
发布评论:

提交评论
- 加载中...
梦如初夏

2020-12-08 11:27

Compute the SOUNDEX hash (which is built into many SQL database engines) and index by it.

SOUNDEX is a hash based on the sound of the words, so spelling errors of the same word are likely to have the same SOUNDEX hash.

Then find the SOUNDEX hash of the search string, and match on it.

0 讨论(0)
发布评论:

提交评论
- 加载中...
刺人心

2020-12-08 11:33

Since the amount of data is large, when inserting a record I would compute and store the value of the phonetic algorithm in an indexed column and then constrain (WHERE clause) my select queries within a range on that column.

0 讨论(0)
发布评论:

提交评论
- 加载中...
遥遥无期

2020-12-08 11:44

If your database supports it, you should use full-text search. Otherwise, you can use an indexer like lucene and its various implementations.

0 讨论(0)
发布评论:

提交评论
- 加载中...
Happy的楠姐

2020-12-08 11:48

https://en.wikipedia.org/wiki/Levenshtein_distance

Levenshtein algorithm has been implemented in some DBMS

(E.g. PostgreSql: http://www.postgresql.org/docs/9.1/static/fuzzystrmatch.html)

0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页