Lucene.Net support phrases?: What is best approach to tokenize comma-delimited data (atomically) in fields during indexing?

99封情书 提交于 2019-12-11 04:54:29

问题


I have a database with a column I wish to index that has comma-delimited names, e.g.,

User.FullNameList = "Helen Ready, Phil Collins, Brad Paisley"

I prefer to tokenize each name atomically (name as a whole searchable entity). What is the best approach for this?

  1. Did I miss a simple option to set the tokenize delimiter?
  2. Do I have to subclass or write my own class that to roll my own tokenizer?
  3. Something else? ;)

Or does Lucene.net not support phrases?

Or is it smart enough to handle this use case automatically?

I'm sure I'm not the first person to have to do this. Googling produced no noticeable solutions.

*** EDIT: using my example, I want to store these name phrases in a single field:

Helen Ready

Phil Collins

Brad Paisley

NOT these individual words:

Helen

Ready

Phil

Collins

Brad

Paisley


回答1:


Edit: Having read your clarification, here is hopefully a more relevant answer:

  1. You did not miss an option to modify the separator character.
  2. You do need to roll your own tokenizer. I suggest you subclass CharTokenizer. You need to define isTokenChar() according to your spec, meaning that anything but a comma is a token char.



回答2:


You can split the string by comma yourself, and either --

  • Index each name using the Keyword analyzer (non-tokenized)
  • OR index each name using the standard analyzer, and wrap your searches in quotes. Make sure to index a dummy term in between each name so that "Ready Phil" doesn't match the document


来源:https://stackoverflow.com/questions/2447139/lucene-net-support-phrases-what-is-best-approach-to-tokenize-comma-delimited-d

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!