How to modify standard analyzer to include #?

问题

Some characters are treated as delimiters like #, so they would never match in the query. What should be the custom analyzer configuration closest to standard to allow these characters to be matched ?

回答1:

1) Simplest way would be to use whitespace tokenizer with lowercase filter.

curl -XGET 'localhost:9200/_analyze?tokenizer=whitespace&filters=lowercase&pretty' -d 'new year #celebration vegas'

which would give you

{
  "tokens" : [ {
    "token" : "new",
    "start_offset" : 0,
    "end_offset" : 3,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "year",
    "start_offset" : 4,
    "end_offset" : 8,
    "type" : "word",
    "position" : 2
  }, {
    "token" : "#celebration",
    "start_offset" : 9,
    "end_offset" : 21,
    "type" : "word",
    "position" : 3
  }, {
    "token" : "vegas",
    "start_offset" : 22,
    "end_offset" : 27,
    "type" : "word",
    "position" : 4
  } ]
}

2) If you only want to preserve some special characters then, you could map them with char filter, so that your text would be transformed into something else before tokenization takes place. This is more closer to standard analyzer. For e.g you can create your index like this

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "special_analyzer": {
          "char_filter": [
            "special_mapping"
          ],
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "asciifolding"
          ]
        }
      },
      "char_filter": {
        "special_mapping": {
          "type": "mapping",
          "mappings": [
            "#=>hashtag\\u0020"
          ]
        }
      }
    }
  },
  "mappings": {
    "my_type": {
      "properties": {
        "tweet": {
          "type": "string",
          "analyzer": "special_analyzer"
        }
      }
    }
  }
}

Now for curl -XPOST 'localhost:9200/my_index/_analyze?analyzer=special_analyzer&pretty' -d 'new year #celebration vegas' custom analyzer will generate following tokens

{
  "tokens" : [ {
    "token" : "new",
    "start_offset" : 0,
    "end_offset" : 3,
    "type" : "<ALPHANUM>",
    "position" : 1
  }, {
    "token" : "year",
    "start_offset" : 4,
    "end_offset" : 8,
    "type" : "<ALPHANUM>",
    "position" : 2
  }, {
    "token" : "hashtag",
    "start_offset" : 9,
    "end_offset" : 9,
    "type" : "<ALPHANUM>",
    "position" : 3
  }, {
    "token" : "celebration",
    "start_offset" : 10,
    "end_offset" : 21,
    "type" : "<ALPHANUM>",
    "position" : 4
  }, {
    "token" : "vegas",
    "start_offset" : 22,
    "end_offset" : 27,
    "type" : "<ALPHANUM>",
    "position" : 5
  } ]
}

so you can search like this

GET my_index/_search
{
  "query": {
    "match": {
      "tweet": "#celebration"
    }
  }
}

you will also be able to search for only celebration because I have used unicode for space \\u0020 otherwise we would always have to search with #

hope this helps!!

来源：https://stackoverflow.com/questions/34754057/how-to-modify-standard-analyzer-to-include

标签

ElasticSearch

analyzer