问题
Question in short: When executing a query with a subaggregation, why does the inner aggregation miss data in some cases?
Question in detail: I have a search query with a subaggregation (buckets in buckets) as follows:
{
"size": 0,
"aggs": {
"outer_docs": {
"terms": {"size": 20, "field": "field_1_to_aggregate_on"},
"aggs": {
"inner_docs": {
"terms": {"size": 10000, "field": "field_2_to_aggregate_on"},
"aggs": "things to display here"
}
}
}
}
}
If I execute this query, for some outer_docs, I receive not all inner_docs that are associated with it. In the output below, there are three inner docs for outer doc key_1.
{
"hits": {
"total": 9853,
"max_score": 0.0,
"hits": []
},
"aggregations": {
"outer_docs": {
"doc_count_error_upper_bound": -1, "sum_other_doc_count": 9801,
"buckets": [
{
"key": "key_1", "doc_count": 3,
"inner_docs": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{"key": "1", "doc_count": 1, "some": "data here"},
...
{"key": "3", "doc_count": 1, "some": "data here"},
]
}
},
...
]
}
}
}
Now, I add a query to singly select one outer_doc that would have been in the first 20 anyway.
"query": {"bool": {"must": [{'term': {'field_1_to_aggregate_on': 'key_1'}}]}}
In this case, I do get all inner_docs, which are in the output below seven inner docs for outer doc key_1.
{
"hits": {
"total": 8,
"max_score": 0.0,
"hits": []
},
"aggregations": {
"outer_docs": {
"doc_count_error_upper_bound": -1, "sum_other_doc_count": 9801,
"buckets": [
{
"key": "key_1", "doc_count": 8,
"inner_docs": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{"key": "1", "doc_count": 1, "some": "data here"},
...
{"key": "7", "doc_count": 2, "some": "data here"},
]
}
},
...
]
}
}
}
I have specified explicitly that I want 10,000 inner_docs per outer_doc. What is preventing me from getting all data?
This is my version information:
{
'build_date': '2018-09-26T13:34:09.098244Z',
'build_flavor': 'default',
'build_hash': '04711c2',
'build_snapshot': False,
'build_type': 'deb',
'lucene_version': '7.4.0',
'minimum_index_compatibility_version': '5.0.0',
'minimum_wire_compatibility_version': '5.6.0',
'number': '6.4.2'
}
EDIT: After digging a bit more, I found out that the issue was unrelated to subaggregation, but to aggregation itself and the usage of shards. I have opened this bug report for Elastic about it:
- https://discuss.elastic.co/t/bug-in-aggregation-result-when-using-shards/164161
- https://github.com/elastic/elasticsearch/issues/37425
回答1:
How about using the composite aggregation for this? Pretty sure that solves your problem.
GET /_search
{
"aggs" : {
"all_docs": {
"composite" : {
"size": 1000,
"sources" : [
{ "outer_docs": { "terms": { "field": "field_1_to_aggregate_on" } } },
{ "inner_docs": { "terms": { "field": "field_2_to_aggregate_on" } } }
]
}
}
}
}
If you have many buckets, the composite aggregation will help you scroll through each of them using size/after.
回答2:
Check your elastic deprecation logfile. You will probably have some warnings like this:
This aggregation creates too many buckets (10001) and will throw an error in future versions. You should update the [search.max_buckets] cluster setting or use the [composite] aggregation to paginate all buckets in multiple requests.
search.max_buckets is a dynamic cluster setting that defaults to 10.000 buckets in 7.0.
Now, this is not documented anywhere, but in my experience: Allocating over 10.000 buckets result in the termination of your query, but you will get back the results that have been achieved until that moment. This explains missing data in your result
Using the composite Aggregation will help, your other option is to increase the max_buckets. Be careful with that, you can crash your entire cluster that way, because there is a cost for every bucket (RAM). It does not matter if you actually use all the allocated buckets, you can crash with empty buckets only.
See:
https://www.elastic.co/guide/en/elasticsearch/reference/master/breaking-changes-7.0.html#_literal_search_max_buckets_literal_in_the_cluster_setting https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket.html https://github.com/elastic/elasticsearch/issues/35896
回答3:
It turned out that the problem was not due to subaggregation, and that it is an actual feature of ElasticSearch. We are using 5 shards, and when using shards, aggregations only return approximate results.
We have made this problem reproducible, and posted it in the Elastic discuss forum. There, we learned that aggregations do not always return all data, with a link to the documentation where this is explained in more detail.
We also learned that using only 1 shard solves the issue, and when that is not possible, the parameter shard_size
can alleviate the problem.
来源:https://stackoverflow.com/questions/54091802/subaggregation-leads-to-missing-data