Search with various combinations of space, hyphen, casing and punctuations

[亡魂溺海] 提交于 2019-11-30 18:49:34

We considered hyphenated words as a special case and wrote a custom analyzer that was used at index time to create three versions of this token, so in your case wal-mart would become walmart, wal mart and wal-mart. Each of these synonyms were written out using a custom SynonymFilter that was initially adapted from an example in the Lucene in Action book. The SynonymFilter sat between the Whitespace tokenizer and the Lowercase tokenizer.

At search time, either of the three versions would match one of the synonyms in the index.

Why does "WalMart" not match "Walmart" with my initial schema?

Because you have defined the mm parameter of your DisMax/eDismax handler with a too high value. I have played around with it. When you define the mm value to 100% you will get no match. But why?

Because you are using the same analyzer for query and index time. Your search term "WalMart" is separated into 3 tokens (words). Namely these are "wal", "mart" and "walmart". Solr will now treat each word individually when counting towards the <str name="mm">100%</str>*.

By the way I have reproduced your problem, but there the problem occurs when indexing Walmart, but querying with WalMart. When performing it the other way around, it works fine.

You can override this by using LocalParams, you could rephrase your query like this {!mm=1}WalMart.

There are more slightly complex ones like [ ... ] "Mc Donald's" [ to match ] Words with different punctuations: "Mc-Donald Engineering Company, Inc."

Here also playing with the mm parameter helps.

In general, what's the best way to go around modeling the schema with this kind of requirement?

Here I agree with Sujit Pal, you should go and implement an own copy of the SynonymFilter. Why? Because it works differently from the other filters and tokenizers. It creates tokens inplace the offset of the indexed words.

What inplace? It will not increase the token count of your query. And you can perform the back hyphenation (joining two words that are separated by a blank).

But we are lacking a good synonyms.txt and cannot keep it up-to-date.

When extending or copying the SynonymFilter ignore the static mapping. You may remove the code that maps the words. You just need the offset handling.

Update I think you can also try the PatternCaptureGroupTokenFilter, but tackling company names with regular expressions may soon face its' limits. I will have a look into this later.


* You can find this in your solrconfig.xml, have a look for your <requestHandler ... />

I'll take the liberty of first making some adjustments to the analyzer. I'd consider WordDelimiterFilter to be functionally a second-step tokenization, so let's put it right after the Tokenizer. After that, there is no need to maintain case, so lowercase comes next. That's better for your StopFilter, since we don't need to worry about the ignorecase anymore. Then add the stemmer.

<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" splitOnNumerics="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory"
        words="stopwords.txt"
        enablePositionIncrements="true"
        />
<filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>

All in all, this isn't too far off. The main problem is "Wal Mart" vs "Walmart". For each of these, WordDelimiterFilter has nothing to do with it, it's the tokenizer that is splitting here. "Wal Mart" gets split by the tokenizer. "Walmart" never gets split, since nothing can reasonably know where it should be split up.

One solution for that would be to use KeywordTokenizer instead, and let WordDelimiterFilter do all of the tokenizing, but that'll lead to other problems (particularly, when dealing with longer, more complex text, like your "Mc-Donald Engineering Company, Inc." example will be problematic).

Instead, I'd recommend a ShingleFilter. This allows you to combine adjacent tokens into a single token to search on. This means, when indexing "Wal Mart", it will take the tokens "wal" and "mart" and also index the term "walmart". Normally, it would also insert a separator, but for this case, you'll want to override that behavior, and specify a separator of "".

We'll put the ShingleFilter at the end now (it'll tend to screw up stemming if you put it before the stemmer):

<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" splitOnNumerics="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory"
        words="stopwords.txt"
        enablePositionIncrements="true"
        />
<filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
<filter class="solr.ShingleFilterFactory" maxShingleSize="2" tokenSeparator=""/>

This will only create shingle of 2 consecutive tokens (as well as the original single tokens), so I'm assuming you don't need to match up more than that (if you were to need "doremi" to match "Do Re Mi", for instance). But for the examples given, this works in my tests.

Upgrading the Lucene version (4.4 to 4.10) in solrconfig.xml fixed the problem magically! I do not have anymore limitations and my query analyzer behaves as expected too.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!