Lucene Fuzzy Search for customer names and partial address

时光毁灭记忆、已成空白 提交于 2019-12-04 06:09:47

Rushik, here are a few ideas:

  • Consider using Solr. It is much easier to start using it, rather than bare Lucene.
  • Build a Lucene/Solr index of the file. It appears that a document per customer is enough, if you use a multi-valued field or two different fields for addresses.
  • Do you have a unique id per person? To use Solr, you need one. In Lucene, you can get away without using a unique id.
  • Store the country code as a "keyword". If you only require exact match for date of birth, you may do the same. For range queries, you will need another representation.
  • I assume your customer list is smaller than the file. A possible policy would be to daily index the changes in the file (Here a unique id is really handy - otherwise you need to delete by query, which may miss the mark). Then you can optimize the index, and after that run a search for your updated customer list.
  • What you describe is a BooleanQuery, Whose clauses are fuzzy queries for the first and last names and term queries for the other fields. You can create the query programmaticaly or using the query parser.
  • Consider using soundex for names as described here.

Some academic papers on this subject are well worth reading (google for the free PDFs):

  • A Comparison of Personal Name Matching: Techniques and Practical Issues (2006)
  • Overview of Record Linkage and Current Research Directions (2006)
  • A Parallel Open Source Data Linkage System (2004)

You should also consider the following libraries/frameworks:

(Answered for future visitors.)

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!