How to modify standard analyzer to include #?

不羁的心 提交于 2019-12-08 08:29:59

问题


Some characters are treated as delimiters like #, so they would never match in the query. What should be the custom analyzer configuration closest to standard to allow these characters to be matched ?


回答1:


1) Simplest way would be to use whitespace tokenizer with lowercase filter.

curl -XGET 'localhost:9200/_analyze?tokenizer=whitespace&filters=lowercase&pretty' -d 'new year #celebration vegas'

which would give you

{
  "tokens" : [ {
    "token" : "new",
    "start_offset" : 0,
    "end_offset" : 3,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "year",
    "start_offset" : 4,
    "end_offset" : 8,
    "type" : "word",
    "position" : 2
  }, {
    "token" : "#celebration",
    "start_offset" : 9,
    "end_offset" : 21,
    "type" : "word",
    "position" : 3
  }, {
    "token" : "vegas",
    "start_offset" : 22,
    "end_offset" : 27,
    "type" : "word",
    "position" : 4
  } ]
}

2) If you only want to preserve some special characters then, you could map them with char filter, so that your text would be transformed into something else before tokenization takes place. This is more closer to standard analyzer. For e.g you can create your index like this

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "special_analyzer": {
          "char_filter": [
            "special_mapping"
          ],
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "asciifolding"
          ]
        }
      },
      "char_filter": {
        "special_mapping": {
          "type": "mapping",
          "mappings": [
            "#=>hashtag\\u0020"
          ]
        }
      }
    }
  },
  "mappings": {
    "my_type": {
      "properties": {
        "tweet": {
          "type": "string",
          "analyzer": "special_analyzer"
        }
      }
    }
  }
}

Now for curl -XPOST 'localhost:9200/my_index/_analyze?analyzer=special_analyzer&pretty' -d 'new year #celebration vegas' custom analyzer will generate following tokens

{
  "tokens" : [ {
    "token" : "new",
    "start_offset" : 0,
    "end_offset" : 3,
    "type" : "<ALPHANUM>",
    "position" : 1
  }, {
    "token" : "year",
    "start_offset" : 4,
    "end_offset" : 8,
    "type" : "<ALPHANUM>",
    "position" : 2
  }, {
    "token" : "hashtag",
    "start_offset" : 9,
    "end_offset" : 9,
    "type" : "<ALPHANUM>",
    "position" : 3
  }, {
    "token" : "celebration",
    "start_offset" : 10,
    "end_offset" : 21,
    "type" : "<ALPHANUM>",
    "position" : 4
  }, {
    "token" : "vegas",
    "start_offset" : 22,
    "end_offset" : 27,
    "type" : "<ALPHANUM>",
    "position" : 5
  } ]
}

so you can search like this

GET my_index/_search
{
  "query": {
    "match": {
      "tweet": "#celebration"
    }
  }
}

you will also be able to search for only celebration because I have used unicode for space \\u0020 otherwise we would always have to search with #

hope this helps!!



来源:https://stackoverflow.com/questions/34754057/how-to-modify-standard-analyzer-to-include

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!