I\'m trying to implement autocomplete using Elasticsearch thinking that I understand how to do it...
I\'m trying to build multi-word (phrase) suggestions by using ES
I want to publish some detailed use cases.
By default this tokenizer treats all text as a single token because by default token can contain any characters (including spaces).
GET {ELASTICSEARCH_URL}/_analyze
{
"tokenizer": "edge_ngram",
"text": "How are you?"
}
Result:
["H", "Ho"]
Explanation: one token, min_gram = 1, max_gram = 2.
PUT {ELASTICSEARCH_URL}/custom_edge_ngram
{
"settings": {
"analysis": {
"analyzer": {
"custom_edge_ngram": {
"tokenizer": "custom_edge_ngram_tokenizer"
}
},
"tokenizer": {
"custom_edge_ngram_tokenizer": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 7
}
}
}
}
}
GET {ELASTICSEARCH_URL}/custom_edge_ngram/_analyze
{
"analyzer": "custom_edge_ngram",
"text": "How old are you?"
}
Result:
["Ho", "How", "How ", "How o", "How ol", "How old"]
Explanation: still one token, min_gram = 2, max_gram = 7.
PUT {ELASTICSEARCH_URL}/custom_edge_ngram_2
{
"settings": {
"analysis": {
"analyzer": {
"custom_edge_ngram": {
"tokenizer": "custom_edge_ngram_tokenizer"
}
},
"tokenizer": {
"custom_edge_ngram_tokenizer": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 7,
"token_chars": ["letter"]
}
}
}
}
}
GET {ELASTICSEARCH_URL}/custom_edge_ngram_2/_analyze
{
"analyzer": "custom_edge_ngram",
"text": "How old are you?"
}
Result:
["Ho", "How", "ol", "old", "ar", "are", "yo", "you"]
Explanation: 4 tokens How, old, are, you (tokens can contain only letters because of token_chars), min_gram = 2, max_gram = 7, but max token length in the sentence is 3.
Tokenizer converts text to stream of tokens.
Token filter works with each token of the stream.
Token filter can modify stream by adding, updating, deleting its tokens.
Let's use standard tokenizer.
GET {ELASTICSEARCH_URL}/_analyze
{
"tokenizer": "standard",
"text": "How old are you?"
}
Result:
["How", "old", "are", "you"]
Now let's add token filter.
GET {ELASTICSEARCH_URL}/_analyze
{
"tokenizer": "standard",
"filter": [
{ "type": "edge_ngram",
"min_gram": 2,
"max_gram": 7
}
],
"text": "How old are you?"
}
Result:
["Ho", "How", "ol", "old", "ar", "are", "yo", "you"]
Explanation: edge_nram for each token.