Working with hyphenated words in SOLR

左心房为你撑大大i 提交于 2020-01-04 13:42:48

问题


I have a hyphenated word. In my case it is "re-use". I want to be able to match it for "re-use", "reuse" and "re use".

If I use a WordDelimiterFilterFactory with catenateAll=1 then it will transform "re-use" into "reuse". This doesn't cover the case of a search for "re use".

In addition to this, the word 're-use' is being used as as a synonym with SynonymFilterFactory so the solution would have to work with that too.

If my synonym file says "re-use => other thing", then I want to be able to match "other thing" when I type "re-use" or "reuse" or "re use". I have tried actually creating a synonym entry like "re use => re-use". This results in matching documents containing the non-hyphenated version, but doesn't then match "other thing" (I don't mind being extra-permissive about matching "re" or "use").

I could add a synonym for this word, but I'd like a general solution. Is there something obvious that I've missed?

EDIT:

I have 4 documents:

  • "thing"
  • "re use"
  • "re-use"
  • "reuse"

I want to search for any of these terms and return all the documents. The relevant bit of my schema:

<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.WordDelimiterFilterFactory" catenateAll="1" preserveOriginal="1" />
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.WordDelimiterFilterFactory" catenateAll="1" preserveOriginal="1" />
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
</fieldType>

If my synonyms file looks like this, everything works as expected.

re use, reuse, thing

However, I want to represent that "re use" and "reuse" are synonyms. I also want to say that "reuse" and "thing", and lots of other things are synonyms. So I tried this:

re use, reuse
reuse, thing

This doesn't work. I think that lexk's answer suggested that it would?


回答1:


it's enough to define synonym rule for re-use, if you are doing indexing time expansion. Say, you have re-use. Then you transform it to reuse. Then you apply SynonymFilter so that you get re-use,reuse,'other thing' at the same index position. When you search for 'other thing', you get the match regardless of how many variations of re-use you created.



来源:https://stackoverflow.com/questions/17952126/working-with-hyphenated-words-in-solr

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!