问题
I have the following json
[
{"firstname": "john", "lastname": "doe"},
{"firstname": "john", "lastname": "smith"},
{"firstname": "jane", "lastname": "smith"},
{"firstname": "jane", "lastname": "doe"},
{"firstname": "joe", "lastname": "smith"},
{"firstname": "joe", "lastname": "doe"},
{"firstname": "steve", "lastname": "smith"},
{"firstname": "jack", "lastname": "doe"}
]
I want to get a count of duplicate firstnames
duplicates count 3
Count of non-duplicate firstnames
non-duplicates count 2
I tried to count the number of buckets, but it seems to count all buckets whether it's duplicate or non-duplicate
GET mynames/_search
{
"aggs" : {
"name_count" : {
"terms" : {
"field" : "firstname.keyword",
"min_doc_count": 2
}
},
"count":{
"cardinality": {
"field": "firstname.keyword"
}
}
}
回答1:
Well I've made use of several aggregations here. The below are the lists which I've used. The order of the list is the execution order of the aggregation.
For Duplicates
- Terms Aggregation
- Stats Bucket Aggregation
For Non Duplicates
- Terms Aggregation
- Bucket Selector (As a sub aggregation)
- Sum Bucket Selector
Aggregation Query:
POST <your_index_name>/_search
{
"size":0,
"aggs":{
"duplicate_aggs":{
"terms":{
"field":"firstname.keyword",
"min_doc_count":2
}
},
"duplicate_bucketcount":{
"stats_bucket":{
"buckets_path":"duplicate_aggs._count"
}
},
"nonduplicate_aggs":{
"terms":{
"field":"firstname.keyword"
},
"aggs":{
"equal_one":{
"bucket_selector":{
"buckets_path":{
"count":"_count"
},
"script":"params.count == 1"
}
}
}
},
"nonduplicate_bucketcount":{
"sum_bucket":{
"buckets_path":"nonduplicate_aggs._count"
}
}
}
}
Response
{
"took": 10,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 8,
"max_score": 0,
"hits": []
},
"aggregations": {
"duplicate_aggs": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "jane",
"doc_count": 2
},
{
"key": "joe",
"doc_count": 2
},
{
"key": "john",
"doc_count": 2
}
]
},
"nonduplicate_aggs": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "jack",
"doc_count": 1
},
{
"key": "steve",
"doc_count": 1
}
]
},
"duplicate_bucketcount": {
"count": 3,
"min": 2,
"max": 2,
"avg": 2,
"sum": 6
},
"nonduplicate_bucketcount": {
"value": 2
}
}
}
Notice that in the above response, we have duplicate_bucketcount.count
key whose value 3
is what would display the bucket count which is the number of keys which are duplicates.
Let me know if it helps!
来源:https://stackoverflow.com/questions/53359102/elasticsearch-count-duplicated-and-unique-values