ElasticSearch - Searching with hyphens

感情迁移 提交于 2019-12-17 20:35:53

问题


Elastic Search 1.6

I want to index text that contains hyphens, for example U-12, U-17, WU-12, t-shirt... and to be able to use a "Simple Query String" query to search on them.

Data sample (simplified):

{"title":"U-12 Soccer",
 "comment": "the t-shirts are dirty"}

As there are quite a lot of questions already about hyphens, I tried the following solution already:

Use a Char filter: ElasticSearch - Searching with hyphens in name.

So I went for this mapping:

{
  "settings":{
    "analysis":{
      "char_filter":{
        "myHyphenRemoval":{
          "type":"mapping",
          "mappings":[
            "-=>"
          ]
        }
      },
      "analyzer":{
        "default":{
          "type":"custom",
          "char_filter":  [ "myHyphenRemoval" ],
          "tokenizer":"standard",
          "filter":[
            "standard",
            "lowercase"
          ]
        }
      }
    }
  },
  "mappings":{
    "test":{
      "properties":{
        "title":{
          "type":"string"
        },
        "comment":{
          "type":"string"
        }
      }
    }
  }
}

Searching is done with the following query:

{"_source":true,
  "query":{
    "simple_query_string":{
      "query":"<Text>",
      "default_operator":"AND"
    }
  }
}
  1. What works:

    "U-12", "U*", "t*", "ts*"

  2. What didn't work:

    "U-*", "u-1*", "t-*", "t-sh*", ...

So it seems the char filter is not executed on search strings? What could I do to make this work?


回答1:


The answer is really simple:

Quote from Igor Motov: Configuring the standard tokenizer

By default the simple_query_string query doesn't analyze the words with wildcards. As a result it searches for all tokens that start with i-ma. The word i-mac doesn't match this request because during analysis it's split into two tokens i and mac and neither of these tokens starts with i-ma. In order to make this query find i-mac you need to make it analyze wildcards:

{
  "_source":true,
  "query":{
    "simple_query_string":{
      "query":"u-1*",
      "analyze_wildcard":true,
      "default_operator":"AND"
    }
  }
}



回答2:


the Quote from Igor Motov is true, you have to add "analyze_wildcard":true, in order to make it worked with regex. But it is important to notice that the hyphen actually tokenizes "u-12" in "u" "12", two separated words.

if preserve the original is important do not use Mapping char filter. Otherwise is kind of useful.

Imagine that you have "m0-77", "m1-77" and "m2-77", if you search m*-77 you are going to have zero hits. However you can remplace "-" (hyphen) with AND in order to connect the two separed words and then search m* AND 77 that is going to give you a correct hit.

you can do it in the client front.

In your problem u-*

{
  "query":{
    "simple_query_string":{
      "query":"u AND 1*",
      "analyze_wildcard":true
    }
  }
}

t-sh*

  {
      "query":{
        "simple_query_string":{
          "query":"t AND sh*",
          "analyze_wildcard":true
        }
      }
    }



回答3:


If anyone is still looking for a simple workaround to this issue, replace hyphen with underscore _ when indexing data.

For eg, O-000022334 should indexed as O_000022334.

When searching, replace underscore back to hyphen again when displaying results. This way you can search for "O-000022334" and it will find a correct match.



来源:https://stackoverflow.com/questions/30917043/elasticsearch-searching-with-hyphens

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!