How to wisely combine shingles and edgeNgram to provide flexible full text search?

♀尐吖头ヾ 提交于 2019-11-26 17:46:18

This is an interesting use case. Here's my take:

{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_ngram_analyzer": {
          "tokenizer": "my_ngram_tokenizer",
          "filter": ["lowercase"]
        },
        "my_edge_ngram_analyzer": {
          "tokenizer": "my_edge_ngram_tokenizer",
          "filter": ["lowercase"]
        },
        "my_reverse_edge_ngram_analyzer": {
          "tokenizer": "keyword",
          "filter" : ["lowercase","reverse","substring","reverse"]
        },
        "lowercase_keyword": {
          "type": "custom",
          "filter": ["lowercase"],
          "tokenizer": "keyword"
        }
      },
      "tokenizer": {
        "my_ngram_tokenizer": {
          "type": "nGram",
          "min_gram": "2",
          "max_gram": "25"
        },
        "my_edge_ngram_tokenizer": {
          "type": "edgeNGram",
          "min_gram": "2",
          "max_gram": "25"
        }
      },
      "filter": {
        "substring": {
          "type": "edgeNGram",
          "min_gram": 2,
          "max_gram": 25
        }
      }
    }
  },
  "mappings": {
    "test_type": {
      "properties": {
        "text": {
          "type": "string",
          "analyzer": "my_ngram_analyzer",
          "fields": {
            "starts_with": {
              "type": "string",
              "analyzer": "my_edge_ngram_analyzer"
            },
            "ends_with": {
              "type": "string",
              "analyzer": "my_reverse_edge_ngram_analyzer"
            },
            "exact_case_insensitive_match": {
              "type": "string",
              "analyzer": "lowercase_keyword"
            }
          }
        }
      }
    }
  }
}
  • my_ngram_analyzer is used to split every text into small pieces, how large the pieces are depends on your use case. I chose, for testing purposes, 25 chars. lowercase is used since you said case-insensitive. Basically, this is the tokenizer used for substringof('table 1',name). The query is simple:
{
  "query": {
    "term": {
      "text": {
        "value": "table 1"
      }
    }
  }
}
  • my_edge_ngram_analyzer is used to split the text starting from the beginning and this is specifically used for the startswith(name,'table 1') use case. Again, the query is simple:
{
  "query": {
    "term": {
      "text.starts_with": {
        "value": "table 1"
      }
    }
  }
}
  • I found this the most tricky part - the one for endswith(name,'table 1'). For this I defined my_reverse_edge_ngram_analyzer which uses a keyword tokenizer together with lowercase and an edgeNGram filter preceded and followed by a reverse filter. What this tokenizer basically does is to split the text in edgeNGrams but the edge is the end of the text, not the start (like with the regular edgeNGram). The query:
{
  "query": {
    "term": {
      "text.ends_with": {
        "value": "table 1"
      }
    }
  }
}
  • for the name eq 'table 1' case, a simple keyword tokenizer together with a lowercase filter should do it The query:
{
  "query": {
    "term": {
      "text.exact_case_insensitive_match": {
        "value": "table 1"
      }
    }
  }
}

Regarding query_string, this changes the solution a bit, because I was counting on term to not analyze the input text and to match it exactly with one of the terms in the index.

But this can be "simulated" with query_string if the appropriate analyzer is specified for it.

The solution would be a set of queries like the following (always use that analyzer, changing only the field name):

{
  "query": {
    "query_string": {
      "query": "text.starts_with:(\"table 1\")",
      "analyzer": "lowercase_keyword"
    }
  }
}
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!