Tokenizer vs token filters

后端未结

关注

 2  686

青春惊慌失措 2020-12-08 01:25

I\'m trying to implement autocomplete using Elasticsearch thinking that I understand how to do it...

I\'m trying to build multi-word (phrase) suggestions by using ES

2条回答

南笙 (楼主)

2020-12-08 01:37

I want to publish some detailed use cases.

Edge n-gram tokenizer (default)

By default this tokenizer treats all text as a single token because by default token can contain any characters (including spaces).

GET {ELASTICSEARCH_URL}/_analyze

{
  "tokenizer": "edge_ngram",
  "text": "How are you?"
}

Result:

["H", "Ho"]

Explanation: one token, min_gram = 1, max_gram = 2.

Edge n-gram tokenizer (custom without token_chars)

PUT {ELASTICSEARCH_URL}/custom_edge_ngram

{
  "settings": {
    "analysis": {
      "analyzer": {
        "custom_edge_ngram": {
          "tokenizer": "custom_edge_ngram_tokenizer"
        }
      },
      "tokenizer": {
        "custom_edge_ngram_tokenizer": {
          "type": "edge_ngram",
          "min_gram": 2,
          "max_gram": 7
        }
      }
    }
  }
}

GET {ELASTICSEARCH_URL}/custom_edge_ngram/_analyze

{
  "analyzer": "custom_edge_ngram",
  "text": "How old are you?"
}

Result:

["Ho", "How", "How ", "How o", "How ol", "How old"]

Explanation: still one token, min_gram = 2, max_gram = 7.

Edge n-gram tokenizer (custom with token_chars)

PUT {ELASTICSEARCH_URL}/custom_edge_ngram_2

{
  "settings": {
    "analysis": {
      "analyzer": {
        "custom_edge_ngram": {
          "tokenizer": "custom_edge_ngram_tokenizer"
        }
      },
      "tokenizer": {
        "custom_edge_ngram_tokenizer": {
          "type": "edge_ngram",
          "min_gram": 2,
          "max_gram": 7,
          "token_chars": ["letter"]
        }
      }
    }
  }
}

GET {ELASTICSEARCH_URL}/custom_edge_ngram_2/_analyze

{
  "analyzer": "custom_edge_ngram",
  "text": "How old are you?"
}

Result:

["Ho", "How", "ol", "old", "ar", "are", "yo", "you"]

Explanation: 4 tokens How, old, are, you (tokens can contain only letters because of token_chars), min_gram = 2, max_gram = 7, but max token length in the sentence is 3.

Edge n-gram token filter

Tokenizer converts text to stream of tokens.

Token filter works with each token of the stream.

Token filter can modify stream by adding, updating, deleting its tokens.

Let's use standard tokenizer.

GET {ELASTICSEARCH_URL}/_analyze

{
  "tokenizer": "standard",
  "text": "How old are you?"
}

Result:

["How", "old", "are", "you"]

Now let's add token filter.

GET {ELASTICSEARCH_URL}/_analyze

{
  "tokenizer": "standard",
  "filter": [
    { "type": "edge_ngram",
      "min_gram": 2,
      "max_gram": 7
    }
  ],
  "text": "How old are you?"
}

Result:

["Ho", "How", "ol", "old", "ar", "are", "yo", "you"]

Explanation: edge_nram for each token.

0 讨论(0)

查看其它2个回答