Search with various combinations of space, hyphen, casing and punctuations

前端 未结 4 1235
迷失自我
迷失自我 2021-01-04 08:00

My schema:


  
          


        
4条回答
  •  予麋鹿
    予麋鹿 (楼主)
    2021-01-04 08:36

    I'll take the liberty of first making some adjustments to the analyzer. I'd consider WordDelimiterFilter to be functionally a second-step tokenization, so let's put it right after the Tokenizer. After that, there is no need to maintain case, so lowercase comes next. That's better for your StopFilter, since we don't need to worry about the ignorecase anymore. Then add the stemmer.

    
    
    
    
    
    

    All in all, this isn't too far off. The main problem is "Wal Mart" vs "Walmart". For each of these, WordDelimiterFilter has nothing to do with it, it's the tokenizer that is splitting here. "Wal Mart" gets split by the tokenizer. "Walmart" never gets split, since nothing can reasonably know where it should be split up.

    One solution for that would be to use KeywordTokenizer instead, and let WordDelimiterFilter do all of the tokenizing, but that'll lead to other problems (particularly, when dealing with longer, more complex text, like your "Mc-Donald Engineering Company, Inc." example will be problematic).

    Instead, I'd recommend a ShingleFilter. This allows you to combine adjacent tokens into a single token to search on. This means, when indexing "Wal Mart", it will take the tokens "wal" and "mart" and also index the term "walmart". Normally, it would also insert a separator, but for this case, you'll want to override that behavior, and specify a separator of "".

    We'll put the ShingleFilter at the end now (it'll tend to screw up stemming if you put it before the stemmer):

    
    
    
    
    
    
    

    This will only create shingle of 2 consecutive tokens (as well as the original single tokens), so I'm assuming you don't need to match up more than that (if you were to need "doremi" to match "Do Re Mi", for instance). But for the examples given, this works in my tests.

提交回复
热议问题