What is the best practice of fuzzy search (like '%aaa%' in MySQL) in Elasticsearch 6.8

一曲冷凌霜 提交于 2021-02-11 13:41:32

问题


Background: I use Mysql and there are millions data, each line have twenty columns, we have some complex search and some column use fuzzy match, such as username like '%aaa%', it can't use mysql index unless remove the first %, but we need fuzzy match to do search like Satckoverflow search, i also checked Mysql fulltext index, but it doesn't support complex search whthin one sql if using other index.

My solution: add Elasticsearch as our search engine, insert data into Mysql and Es and search data only in Elasticsearch

I checked Elasticsearch fuzzy search, wildcard works, but many people don't suggest use * in the word beginning, it will make search very slow.

For example: username: 'John_Snow'

wildcard works but may very slow

GET /user/_search
{
  "query": {
    "wildcard": {
      "username": "*hn*"
    }
  }
}

match_phrase doesn't work seems only work on Tokenizer like phrase 'John Snow'

{
  "query": {
      "match_phrase":{
      "dbName": "hn"
      }
  }
}

My question: Is there any better solution to do complex query that contains fuzzy match like '%no%' or '%hn_Sn%'.


回答1:


You can use ngram tokenizer that first breaks text down into words whenever it encounters one of a list of specified characters, then it emits N-grams of each word of the specified length.

Adding a working example with index data, mapping, search query, and results.

Index Mapping:

     {
    "settings": {
        "analysis": {
            "analyzer": {
                "my_analyzer": {
                    "tokenizer": "my_tokenizer"
                }
            },
            "tokenizer": {
                "my_tokenizer": {
                    "type": "ngram",
                    "min_gram": 2,
                    "max_gram": 10,
                    "token_chars": [
                        "letter",
                        "digit"
                    ]
                }
            }
        },
        "max_ngram_diff": 50
    },
    "mappings": {
        "properties": {
            "title": {
                "type": "text",
                "analyzer": "my_analyzer",
                "search_analyzer": "standard"
            }
        }
    }
}

Analyze API

POST/ _analyze

{
  "analyzer": "my_analyzer",
  "text": "John_Snow"
}

The tokens are :

   {
    "tokens": [
        {
            "token": "Jo",
            "start_offset": 0,
            "end_offset": 2,
            "type": "word",
            "position": 0
        },
        {
            "token": "Joh",
            "start_offset": 0,
            "end_offset": 3,
            "type": "word",
            "position": 1
        },
        {
            "token": "John",
            "start_offset": 0,
            "end_offset": 4,
            "type": "word",
            "position": 2
        },
        {
            "token": "oh",
            "start_offset": 1,
            "end_offset": 3,
            "type": "word",
            "position": 3
        },
        {
            "token": "ohn",
            "start_offset": 1,
            "end_offset": 4,
            "type": "word",
            "position": 4
        },
        {
            "token": "hn",
            "start_offset": 2,
            "end_offset": 4,
            "type": "word",
            "position": 5
        },
        {
            "token": "Sn",
            "start_offset": 5,
            "end_offset": 7,
            "type": "word",
            "position": 6
        },
        {
            "token": "Sno",
            "start_offset": 5,
            "end_offset": 8,
            "type": "word",
            "position": 7
        },
        {
            "token": "Snow",
            "start_offset": 5,
            "end_offset": 9,
            "type": "word",
            "position": 8
        },
        {
            "token": "no",
            "start_offset": 6,
            "end_offset": 8,
            "type": "word",
            "position": 9
        },
        {
            "token": "now",
            "start_offset": 6,
            "end_offset": 9,
            "type": "word",
            "position": 10
        },
        {
            "token": "ow",
            "start_offset": 7,
            "end_offset": 9,
            "type": "word",
            "position": 11
        }
    ]
}

Index Data:

{
  "title":"John_Snow"
}

Search Query:

{
    "query": {
        "match" : {
            "title" : "hn"
        }
    }
}

Search Result:

"hits": [
            {
                "_index": "test",
                "_type": "_doc",
                "_id": "1",
                "_score": 0.2876821,
                "_source": {
                    "title": "John_Snow"
                }
            }
        ]

Refer to this blog if you want to do an autocomplete search.

Another search query

{
    "query": {
        "match" : {
            "title" : "ohr"
        }
    }
}

The above search query shows no result



来源:https://stackoverflow.com/questions/63912422/what-is-the-best-practice-of-fuzzy-search-like-aaa-in-mysql-in-elasticsear

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!