问题
I have this in my schema.xml
:
<analyzer type="query">
<tokenizer class="solr.SimplePatternSplitTokenizerFactory" pattern="[ 	

/.,)(]+"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<filter class="solr.WordDelimiterGraphFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1" splitOnCaseChange="1" preserveOriginal="1"/>
<filter class="solr.SynonymGraphFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
I put the SynonymGraphFilter
after the WordDelimiterGraphFilter
because I want to expand synonyms for tokens produced by splitting on hyphens. For example, the WordDelimiterGraphFilter
will split "2-canal" into "2" and "canal", and then the SynonymGraphFilter
will add "two" as a synonym of "2". However, when I query for 2-canal, I get
null:java.lang.ArrayIndexOutOfBoundsException: Index 0 out of bounds for length 0
at org.apache.lucene.util.QueryBuilder.newSynonymQuery(QueryBuilder.java:701)
at org.apache.solr.parser.SolrQueryParserBase.newSynonymQuery(SolrQueryParserBase.java:636)
at org.apache.lucene.util.QueryBuilder.analyzeGraphBoolean(QueryBuilder.java:581)
at org.apache.lucene.util.QueryBuilder.createFieldQuery(QueryBuilder.java:343)
....
This does not happen if I put SynonymGraphFilter
first.
Also, I noticed that in all analyzers pre-built in the managed-schema, SynonymGraphFilter
always comes before WordDelimiterGraphFilter
. Is there a reason for this and if so, how else can I achieve the same effect?
回答1:
Possible cause is this bug: https://github.com/elastic/elasticsearch/issues/46272
SynonymGraphFilter always comes before WordDelimiterGraphFilter. Is there a reason for this and if so, how else can I achieve the same effect?
The order of filters is important, because they will change the tokens to be further processed.
You may define synonyms such as "i-phone->expensive" and WordDelimiter will change the user input "i-phone" to "i" and "phone".
If the synonyme filter comes first, it will match and the token will be "expensive" and the delimer doesn't do anything anymore (result "expensive"). If it is the other way around, it will first delimit (to "i" and "phone") and the synonyme filter doesn't match anymore (because it needs i-phone)
What is "right" depends on how you want to define your synonyms, do you define delimited synonyms ("phone", "i") oder undelimited ("i-phone").
In your excample, the synonyme "2->two" circumvents the bug I just mentioned, because somehow it is related to very short terms. Your synonyme will in fact alter the search "2-canal" to "two-canal" which does not cause troubles in delimiter (in opposite to "2-canal")
来源:https://stackoverflow.com/questions/64120979/solr-is-there-a-reason-why-synonymgraphfilter-must-come-before-worddelimitergra