Search with various combinations of space, hyphen, casing and punctuations

My schema:

<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.StopFilterFactory"
            ignoreCase="true"
            words="stopwords.txt"
            enablePositionIncrements="true"
            />
    <filter class="solr.WordDelimiterFilterFactory"
            generateWordParts="1" generateNumberParts="1"
            catenateWords="1" catenateNumbers="1" catenateAll="0"
            splitOnCaseChange="1" splitOnNumerics="0"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.SnowballPorterFilterFactory" language="English"
            protected="protwords.txt"/>
  </analyzer>
</fieldType>

Combinations that I want to work:

"Walmart", "WalMart", "Wal Mart", "Wal-Mart", "Wal-mart"

Given any of these strings, I want to find the other one.

So, there are 25 such combinations as given below:

(First column denotes input text for search, second column denotes expected match)

(Walmart,Walmart)
(Walmart,WalMart)
(Walmart,Wal Mart)
(Walmart,Wal-Mart)
(Walmart,Wal-mart)
(WalMart,Walmart)
(WalMart,WalMart)
(WalMart,Wal Mart)
(WalMart,Wal-Mart)
(WalMart,Wal-mart)
(Wal Mart,Walmart)
(Wal Mart,WalMart)
(Wal Mart,Wal Mart)
(Wal Mart,Wal-Mart)
(Wal Mart,Wal-mart)
(Wal-Mart,Walmart)
(Wal-Mart,WalMart)
(Wal-Mart,Wal Mart)
(Wal-Mart,Wal-Mart)
(Wal-Mart,Wal-mart)
(Wal-mart,Walmart)
(Wal-mart,WalMart)
(Wal-mart,Wal Mart)
(Wal-mart,Wal-Mart)
(Wal-mart,Wal-mart)

Current limitations with my schema:

1. "Wal-Mart" -> "Walmart",
2. "Wal Mart" -> "Walmart",
3. "Walmart"  -> "Wal Mart",
4. "Wal-mart" -> "Walmart",
5. "WalMart"  -> "Walmart"

Screenshot of the analyzer:

I tried various combinations of filters trying to resolve these limitations, so I got stumbled by the solution provided at: Solr - case-insensitive search do not work

While it seems to overcome one of the limitations that I have (see #5 WalMart -> Walmart), it is overall worse than what I had earlier. Now it does not work for cases like:

(Wal Mart,WalMart), 
(Wal-Mart,WalMart), 
(Wal-mart,WalMart), 
(WalMart,Wal Mart)
besides cases 1 to 4 as mentioned above

Analyzer after schema change:

Questions:

Why does "WalMart" not match "Walmart" with my initial schema ? Solr analyzer clearly shows me that it had produced 3 tokens during index time: wal, mart, walmart. During query time: It has produced 1 token: walmart (while it's not clear why it would produce just 1 token), I fail to understand why it does not match given that walmart is contained in both query and index tokens.
The problem that I mentioned here is just a single use-case. There are more slightly complex ones like:

Words with apostrophes: "Mc Donalds", "Mc Donald's", "McDonald's", "Mc donalds", "Mc donald's", "Mcdonald's"

Words with different punctuations: "Mc-Donald Engineering Company, Inc."

In general, what's the best way to go around modeling the schema with this kind of requirement ? NGrams ? Index same data in different fields (in different formats) and use copyField directive (https://wiki.apache.org/solr/SchemaXml#Indexing_same_data_in_multiple_fields) ? What are the performance implications of this ?

EDIT: The default operator in my Solr schema is AND. I cannot change it to OR.

We considered hyphenated words as a special case and wrote a custom analyzer that was used at index time to create three versions of this token, so in your case wal-mart would become walmart, wal mart and wal-mart. Each of these synonyms were written out using a custom SynonymFilter that was initially adapted from an example in the Lucene in Action book. The SynonymFilter sat between the Whitespace tokenizer and the Lowercase tokenizer.

At search time, either of the three versions would match one of the synonyms in the index.

Why does "WalMart" not match "Walmart" with my initial schema?

Because you have defined the mm parameter of your DisMax/eDismax handler with a too high value. I have played around with it. When you define the mm value to 100% you will get no match. But why?

Because you are using the same analyzer for query and index time. Your search term "WalMart" is separated into 3 tokens (words). Namely these are "wal", "mart" and "walmart". Solr will now treat each word individually when counting towards the <str name="mm">100%</str>*.

By the way I have reproduced your problem, but there the problem occurs when indexing Walmart, but querying with WalMart. When performing it the other way around, it works fine.

You can override this by using LocalParams, you could rephrase your query like this {!mm=1}WalMart.

There are more slightly complex ones like [ ... ] "Mc Donald's" [ to match ] Words with different punctuations: "Mc-Donald Engineering Company, Inc."

Here also playing with the mm parameter helps.

In general, what's the best way to go around modeling the schema with this kind of requirement?

Here I agree with Sujit Pal, you should go and implement an own copy of the SynonymFilter. Why? Because it works differently from the other filters and tokenizers. It creates tokens inplace the offset of the indexed words.

What inplace? It will not increase the token count of your query. And you can perform the back hyphenation (joining two words that are separated by a blank).

But we are lacking a good synonyms.txt and cannot keep it up-to-date.

When extending or copying the SynonymFilter ignore the static mapping. You may remove the code that maps the words. You just need the offset handling.

Update I think you can also try the PatternCaptureGroupTokenFilter, but tackling company names with regular expressions may soon face its' limits. I will have a look into this later.

* You can find this in your solrconfig.xml, have a look for your <requestHandler ... />

I'll take the liberty of first making some adjustments to the analyzer. I'd consider WordDelimiterFilter to be functionally a second-step tokenization, so let's put it right after the Tokenizer. After that, there is no need to maintain case, so lowercase comes next. That's better for your StopFilter, since we don't need to worry about the ignorecase anymore. Then add the stemmer.

<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" splitOnNumerics="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory"
        words="stopwords.txt"
        enablePositionIncrements="true"
        />
<filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>

All in all, this isn't too far off. The main problem is "Wal Mart" vs "Walmart". For each of these, WordDelimiterFilter has nothing to do with it, it's the tokenizer that is splitting here. "Wal Mart" gets split by the tokenizer. "Walmart" never gets split, since nothing can reasonably know where it should be split up.

One solution for that would be to use KeywordTokenizer instead, and let WordDelimiterFilter do all of the tokenizing, but that'll lead to other problems (particularly, when dealing with longer, more complex text, like your "Mc-Donald Engineering Company, Inc." example will be problematic).

Instead, I'd recommend a ShingleFilter. This allows you to combine adjacent tokens into a single token to search on. This means, when indexing "Wal Mart", it will take the tokens "wal" and "mart" and also index the term "walmart". Normally, it would also insert a separator, but for this case, you'll want to override that behavior, and specify a separator of "".

We'll put the ShingleFilter at the end now (it'll tend to screw up stemming if you put it before the stemmer):

<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" splitOnNumerics="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory"
        words="stopwords.txt"
        enablePositionIncrements="true"
        />
<filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
<filter class="solr.ShingleFilterFactory" maxShingleSize="2" tokenSeparator=""/>

This will only create shingle of 2 consecutive tokens (as well as the original single tokens), so I'm assuming you don't need to match up more than that (if you were to need "doremi" to match "Do Re Mi", for instance). But for the examples given, this works in my tests.

Upgrading the Lucene version (4.4 to 4.10) in solrconfig.xml fixed the problem magically! I do not have anymore limitations and my query analyzer behaves as expected too.

来源：https://stackoverflow.com/questions/29783237/search-with-various-combinations-of-space-hyphen-casing-and-punctuations

标签

solr

lucene

string-matching

solrj

textmatching