solr: is there a reason why SynonymGraphFilter must come before WordDelimiterGraphFilter?

守給你的承諾、 提交于 2020-12-15 06:23:09

问题


I have this in my schema.xml:

<analyzer type="query">
  <tokenizer class="solr.SimplePatternSplitTokenizerFactory" pattern="[ &#x9;&#xA;&#xD;/.,)(]+"/>
  <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
  <filter class="solr.WordDelimiterGraphFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1" splitOnCaseChange="1" preserveOriginal="1"/>
  <filter class="solr.SynonymGraphFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
  <filter class="solr.LowerCaseFilterFactory"/>
  <filter class="solr.PorterStemFilterFactory"/>
</analyzer>

I put the SynonymGraphFilter after the WordDelimiterGraphFilter because I want to expand synonyms for tokens produced by splitting on hyphens. For example, the WordDelimiterGraphFilter will split "2-canal" into "2" and "canal", and then the SynonymGraphFilter will add "two" as a synonym of "2". However, when I query for 2-canal, I get

null:java.lang.ArrayIndexOutOfBoundsException: Index 0 out of bounds for length 0
    at org.apache.lucene.util.QueryBuilder.newSynonymQuery(QueryBuilder.java:701)
    at org.apache.solr.parser.SolrQueryParserBase.newSynonymQuery(SolrQueryParserBase.java:636)
    at org.apache.lucene.util.QueryBuilder.analyzeGraphBoolean(QueryBuilder.java:581)
    at org.apache.lucene.util.QueryBuilder.createFieldQuery(QueryBuilder.java:343)
    ....

This does not happen if I put SynonymGraphFilter first.

Also, I noticed that in all analyzers pre-built in the managed-schema, SynonymGraphFilter always comes before WordDelimiterGraphFilter. Is there a reason for this and if so, how else can I achieve the same effect?


回答1:


Possible cause is this bug: https://github.com/elastic/elasticsearch/issues/46272

SynonymGraphFilter always comes before WordDelimiterGraphFilter. Is there a reason for this and if so, how else can I achieve the same effect?

The order of filters is important, because they will change the tokens to be further processed.

You may define synonyms such as "i-phone->expensive" and WordDelimiter will change the user input "i-phone" to "i" and "phone".

If the synonyme filter comes first, it will match and the token will be "expensive" and the delimer doesn't do anything anymore (result "expensive"). If it is the other way around, it will first delimit (to "i" and "phone") and the synonyme filter doesn't match anymore (because it needs i-phone)

What is "right" depends on how you want to define your synonyms, do you define delimited synonyms ("phone", "i") oder undelimited ("i-phone").

In your excample, the synonyme "2->two" circumvents the bug I just mentioned, because somehow it is related to very short terms. Your synonyme will in fact alter the search "2-canal" to "two-canal" which does not cause troubles in delimiter (in opposite to "2-canal")



来源:https://stackoverflow.com/questions/64120979/solr-is-there-a-reason-why-synonymgraphfilter-must-come-before-worddelimitergra

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!