solr: is there a reason why SynonymGraphFilter must come before WordDelimiterGraphFilter?

问题

I have this in my schema.xml:

<analyzer type="query">
  <tokenizer class="solr.SimplePatternSplitTokenizerFactory" pattern="[ &#x9;&#xA;&#xD;/.,)(]+"/>
  <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
  <filter class="solr.WordDelimiterGraphFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1" splitOnCaseChange="1" preserveOriginal="1"/>
  <filter class="solr.SynonymGraphFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
  <filter class="solr.LowerCaseFilterFactory"/>
  <filter class="solr.PorterStemFilterFactory"/>
</analyzer>

I put the SynonymGraphFilter after the WordDelimiterGraphFilter because I want to expand synonyms for tokens produced by splitting on hyphens. For example, the WordDelimiterGraphFilter will split "2-canal" into "2" and "canal", and then the SynonymGraphFilter will add "two" as a synonym of "2". However, when I query for 2-canal, I get

null:java.lang.ArrayIndexOutOfBoundsException: Index 0 out of bounds for length 0
    at org.apache.lucene.util.QueryBuilder.newSynonymQuery(QueryBuilder.java:701)
    at org.apache.solr.parser.SolrQueryParserBase.newSynonymQuery(SolrQueryParserBase.java:636)
    at org.apache.lucene.util.QueryBuilder.analyzeGraphBoolean(QueryBuilder.java:581)
    at org.apache.lucene.util.QueryBuilder.createFieldQuery(QueryBuilder.java:343)
    ....

This does not happen if I put SynonymGraphFilter first.

Also, I noticed that in all analyzers pre-built in the managed-schema, SynonymGraphFilter always comes before WordDelimiterGraphFilter. Is there a reason for this and if so, how else can I achieve the same effect?

回答1:

Possible cause is this bug: https://github.com/elastic/elasticsearch/issues/46272

SynonymGraphFilter always comes before WordDelimiterGraphFilter. Is there a reason for this and if so, how else can I achieve the same effect?

The order of filters is important, because they will change the tokens to be further processed.

You may define synonyms such as "i-phone->expensive" and WordDelimiter will change the user input "i-phone" to "i" and "phone".

If the synonyme filter comes first, it will match and the token will be "expensive" and the delimer doesn't do anything anymore (result "expensive"). If it is the other way around, it will first delimit (to "i" and "phone") and the synonyme filter doesn't match anymore (because it needs i-phone)

What is "right" depends on how you want to define your synonyms, do you define delimited synonyms ("phone", "i") oder undelimited ("i-phone").

In your excample, the synonyme "2->two" circumvents the bug I just mentioned, because somehow it is related to very short terms. Your synonyme will in fact alter the search "2-canal" to "two-canal" which does not cause troubles in delimiter (in opposite to "2-canal")

来源：https://stackoverflow.com/questions/64120979/solr-is-there-a-reason-why-synonymgraphfilter-must-come-before-worddelimitergra

标签

solr