Elastic Search - Tokenization and Multi Match query

我们两清 提交于 2020-07-31 04:20:21

问题


I need to perform tokenization and multi match in a single query in Elastic Search.

Currently, 1)I am using the analyzer to get the tokens like below

 String text = // 4 line log data;
 List<AnalyzeToken> analyzeTokenList = new ArrayList<AnalyzeToken>();
    AnalyzeRequestBuilder analyzeRequestBuilder = this.client.admin().indices().prepareAnalyze();
            for (String newIndex : newIndexes) {
                analyzeRequestBuilder.setIndex(newIndex);
                analyzeRequestBuilder.setText(text);
                analyzeRequestBuilder.setAnalyzer(analyzer);
                Response analyzeResponse = analyzeRequestBuilder.get();
                analyzeTokenList.addAll(analyzeResponse.getTokens());
            }

then, I will iterate through the AnalyzeToken and get the list of tokens,

List<String> tokens = new ArrayList<String>();
for (AnalyzeToken token : tokens)
         {
             tokens.addAll(token.getTerm().replaceAll("\\s+"," "));
         }

then use the tokens and frame the multi-match query like below,

String query = "";
for(string data : tokens) {
   query = query + data;
}

     MultiMatchQueryBuilder multiMatchQueryBuilder = new MultiMatchQueryBuilder(query, "abstract", "title");
    Iterable<Document> result = documentRepository.search(multiMatchQueryBuilder);

Based on the result, I am checking whether similar data exists in the database.

Is it possible to combine as single query - the analyze and multi match query as single query? Any help is appreciated!

EDIT : Problem Statement : Say I have 90 entries in one index, In which each 10 entries in that index are identical (not exactly but will have 70% match) so I will have 9 pairs. I need to process only one entry in each pair, so I went in the following approach (which is not the good way - but as of now I end up with this approach)

Approach :

  1. Get each entry from the 90 entries in the index
  2. Tokenize using the analyzer (this removes the unwanted keywords)
  3. Search in the same index (It checks whether the same kind of data is there in the index) and also filters the flag as processed. --> this flag will be updated after the first log gets processed.
  4. If there is no flag available as processed for the similar kind of data (70% match) then I will process these logs and update the current log flag as processed.
  5. If any data already exist with the flag as processed then I will consider this data is already processed and I will continue with the next one.

So Ideal goal is to, process only one data in the 10 unique entries.

Thanks,
Harry


回答1:


Multi-match queries internally uses the match queries which are analyzed means they apply the same analyzer which is defined in the fields mapping(standard) if there is no analyzer defined.

From the multi-match query doc

The multi_match query builds on the match query to allow multi-field queries:

Also, accepts analyzer, boost, operator, minimum_should_match, fuzziness, lenient, as explained in match query.

So what you are trying to do is overkill, even if you want to change the analyzer(need different tokens during search time) then you can use the search analyzer instead of creating tokens and then using them in multi-match query.



来源:https://stackoverflow.com/questions/63007991/elastic-search-tokenization-and-multi-match-query

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!