How to define a field type for field that contains both chinese and english

不打扰是莪最后的温柔 提交于 2019-12-11 23:13:40

问题


I am now using Solr to index on a field. This field will contain both Chinese and English. At the same time, I need to use tokenizer NGramTokenizerFactory for searching.

Below is the current field type I defined for the field:

<fieldType name="text_general2" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.NGramTokenizerFactory" minGramSize="1" maxGramSize="15"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.NGramTokenizerFactory" minGramSize="1" maxGramSize="15"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

I have to set minGramSize="1" to allow searching a single Chinese character. However, this is totally improper for searching an English word.

e.g. If I search "see", it returns "s", "se", "ee", "see", "e"

Therefore, could anyone please tell what is the best way to index a field that contains both Chinese and English?


回答1:


I'm sure that this isn't the answer you were hoping for, but it's the answer that will actually solve it: Don't use a single field to contain both chinese and english.

Have one field for english and one field for chinese, indexing to the field matching the language of your input content. You can use the Language Detection feature in an update processor to let Solr decide which field to put the content into during indexing if you don't know the language when indexing.

Searching is then done across both fields (depending on your query handler, possibly using qf), allowing for separate processing of tokens in each language against each field (so that english words doesn't get ngram-ed).

If you have both english and chinese in the same document, process the document to decide the chinese and english parts (for example, iterate over each paragraph and detect language, before indexing to different fields).



来源:https://stackoverflow.com/questions/25347429/how-to-define-a-field-type-for-field-that-contains-both-chinese-and-english

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!