ElasticSearch Analyzer and Tokenizer for Emails

匿名 (未验证) 提交于 2019-12-03 01:58:03

问题:

I could not find a perfect solution either in Google or ES for the following situation, hope someone could help here.

Suppose there are five email addresses stored under field "email":

1. {"email": "john.doe@gmail.com"} 2. {"email": "john.doe@gmail.com, john.doe@outlook.com"} 3. {"email": "hello-john.doe@outlook.com"} 4. {"email": "john.doe@outlook.com} 5. {"email": "john@yahoo.com"} 

I want to fulfill the following searching scenarios:

[Search -> Receive]

"john.doe@gmail.com" -> 1,2

"john.doe@outlook.com" -> 2,4

"john@yahoo.com" -> 5

"john.doe" -> 1,2,3,4

"john" -> 1,2,3,4,5

"gmail.com" -> 1,2

"outlook.com" -> 2,3,4

The first three matchings is a MUST, and for the rest of them the more precise the better. Have already tried different combinations of index/search analyzers, tokenizers, and filters. Also tried to work on the condition for match queries, but did not find an ideal solution, any thought is welcome, and no limit to the mappings, analyzers, or which kind of query to use, thanks.

回答1:

Mapping:

PUT /test {   "settings": {     "analysis": {       "filter": {         "email": {           "type": "pattern_capture",           "preserve_original": 1,           "patterns": [             "([^@]+)",             "(\\p{L}+)",             "(\\d+)",             "@(.+)",             "([^-@]+)"           ]         }       },       "analyzer": {         "email": {           "tokenizer": "uax_url_email",           "filter": [             "email",             "lowercase",             "unique"           ]         }       }     }   },   "mappings": {     "emails": {       "properties": {         "email": {           "type": "string",           "analyzer": "email"         }       }     }   } } 

Test data:

POST /test/emails/_bulk {"index":{"_id":"1"}} {"email": "john.doe@gmail.com"} {"index":{"_id":"2"}} {"email": "john.doe@gmail.com, john.doe@outlook.com"} {"index":{"_id":"3"}} {"email": "hello-john.doe@outlook.com"} {"index":{"_id":"4"}} {"email": "john.doe@outlook.com"} {"index":{"_id":"5"}} {"email": "john@yahoo.com"} 

Query to be used:

GET /test/emails/_search {   "query": {     "term": {       "email": "john.doe@gmail.com"     }   } } 


易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!