ELK: How do I retrieve more than 10000 results/events in Elastic-search

眉间皱痕 提交于 2019-11-27 13:36:35

问题


Problem: retrieving more than 10,000 results in elastic search via search in a GET /search query.

GET hostname:port /myIndex/_search { 
    "size": 10000,
    "query": {
        "term": { "field": "myField" }
    }
}

I have been using the size option knowing that:

index.max_result_window = 100000

But if my query has the size of 650,000 Documents for example or even more, how can I retrieve all of the results in one GET?

I have been reading about the SCROLL, FROM-TO, and the PAGINATION API, but all of them never deliver more than 10K.

This is the example from Elasticsearch Forum, that I have been using:

GET /_search?scroll=1m

Can anybody provide an example where you can retrieve all the documents for a GET search query?

Thank you very much.


回答1:


Scroll is the way to go if you want to retrieve a high number of documents, high in the sense that it's way over the 10000 default limit, which can be raised.

The first request needs to specify the query you want to make and the scroll parameter with duration before the search context times out (1 minute in the example below)

POST /index/type/_search?scroll=1m
{
    "size": 1000,
    "query": {
        "match" : {
            "title" : "elasticsearch"
        }
    }
}

In the response to that first call, you get a _scroll_id that you need to use to make the second call:

POST /_search/scroll 
{
    "scroll" : "1m", 
    "scroll_id" : "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAAD4WYm9laVYtZndUQlNsdDcwakFMNjU1QQ==" 
}

In each subsequent response, you'll get a new _scroll_id that you need to use for the next call until you've retrieved the amount of documents you need.

So in pseudo code it looks somewhat like this:

# first request
response = request('POST /index/type/_search?scroll=1m')
docs = [ response.hits ]
scroll_id = response._scroll_id

# subsequent requests
while (true) {
   response = request('POST /_search/scroll', scroll_id)
   docs.push(response.hits)
   scroll_id = response._scroll_id
}



回答2:


nodeJS scroll example using elascticsearch:

const elasticsearch = require('elasticsearch');
const elasticSearchClient = new elasticsearch.Client({ host: 'esURL' });

async function getAllData(query) {
  const result = await elasticSearchClient.search({
    index: '*',
    scroll: '10m',
    size: 10000,
    body: query,
  });

  const retriever = async ({
    data,
    total,
    scrollId,
  }) => {
    if (data.length >= total) {
      return data;
    }

    const result = await elasticSearchClient.scroll({
      scroll: '10m',
      scroll_id: scrollId,
    });

    data = [...data, ...result.hits.hits];

    return retriever({
      total,
      scrollId: result._scroll_id,
      data,
    });
  };

  return retriever({
    total: result.hits.total,
    scrollId: result._scroll_id,
    data: result.hits.hits,
  });
}



回答3:


Another Option is the search_after Tag. Joind with a sorting mechanism you can save your last element in the first return and then ask for results coming after that last element.

    GET twitter/_search
    {
     "size": 10,
        "query": {
            "match" : {
                "title" : "elasticsearch"
            }
        },
        "search_after": [1463538857, "654323"],
        "sort": [
            {"date": "asc"},
            {"_id": "desc"}
        ]
    }

Worked for me. But until now getting more than 10.000 Dokuments is really not easy.




回答4:


here you go:

GET /_search
{
  "size": "10000",
    "query": {
        "match_all": {"boost" : "1.0" }
    }
}

But we should mostly avoid this approach to retrieve huge amount of docs at once as it can increase data usage and overhead.




回答5:


Look at search_after documentation

Example query as hash in Ruby:

query = {
  size: query_size,
  query: {
    multi_match: {
      query: "black",
      fields: [ "description", "title", "information", "params" ]
    }
  },
  search_after: [after],
  sort: [ {id: "asc"} ]

}




回答6:


I can suggest a better way to do this. I guess you're trying to get more than 10,000 records. Try the below way and you will get millions of records as well.

  1. Define your client.

    client = Elasticsearch(['http://localhost:9200'])
    
  2. search = Search(using=client)

  3. Check total number of hits.

    results = search.execute()
    results.hits.total
    
  4. s = Search(using=client)

  5. Write down your query.

    s = s.query(..write your query here...)
    
  6. Dump the data into a data frame with scan. Scan will dump all the data into your data frame even if it's in billions, so be careful.

    results_df = pd.DataFrame((d.to_dict() for d in s.scan()))
    
  7. Have a look at your data frame.

    results_df
    
  8. If you're getting an error with search function, then do below:

    from elasticsearch_dsl import Search
    



回答7:


When there are more than 10000 results, the only way to get the rest is to split your query to multiple, more refined queries with more strict filters, such that each query returns less than 10000 results. And then combine the query results to obtain your complete target result set.

This limitation to 10000 results applies to web services that are backed by ElasticSearch index, and there’s just no way around it, the web service would have to be reimplemented without using ElasticSearch.



来源:https://stackoverflow.com/questions/41655913/elk-how-do-i-retrieve-more-than-10000-results-events-in-elastic-search

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!