How can I set up Solr to tokenize on whitespace and punctuation?

谁说胖子不能爱 提交于 2019-12-22 18:47:10

问题


I have been trying to get my Solr schema (using Solr 1.3.0) to create terms that are tokenized by whitespace and punctuation. Here are some examples on what I would like to see happen:

terms given -> terms tokenized

foo-bar -> foo,bar
one2three4 -> one2three4
multiple words/and some-punctuation -> multiple,words,and,some,punctuation

I thought that this combination would work:

<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"/>
  </analyzer
<fieldType>

The problem is that this results in the following for letter to number transitions:

one2three4 -> one,2,three,4

I have tried various combinations of WordDelimiterFilterFactory settings, but none have proven useful. Is there a filter or tokenizer that can handle what I require?


回答1:


how about

<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" splitOnNumerics="0" />

that should prevent one2three4 to be split



来源:https://stackoverflow.com/questions/3891054/how-can-i-set-up-solr-to-tokenize-on-whitespace-and-punctuation

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!