问题
I'd like to fetch a list of unique numeric user IDs in given period.
Let say the field is userId and time field is startTime, I successfully get results as below;
HashSet<int> hashUserIdList= new HashSet<int>(); // guarantees to store unique userIds.
// Step 1. get unique number of userIds
var total = client.Search<Log>(s => s
.Query(q => q
.DateRange(c => c.Field(p => p.startTime)
.GreaterThan(FixedDate)))
.Aggregations(a => a
.Cardinality("userId_cardinality", c => c
.Field("userId"))))
.Aggs.Cardinality("userId_cardinality");
int totalCount = (int)total.Value;
// Step 2. get unique userId values by Terms aggregation.
var response = client.Search<Log>(s => s
.Source(source => source.Includes(inc => inc.Field("userId")))
.Query(q => q
.DateRange(c => c.Field(p => p.startTime)
.GreaterThan(FixedDate)))
.Aggregations(a => a
.Terms("userId_terms", c => c
.Field("userId").Size(totalCount))))
.Aggs.Terms("userId_terms");
// Step 3. store unique userIds to HashSet.
foreach (var element in response.Buckets)
{
hashUserIdList.Add(int.Parse(element.Key));
}
It works but seems not efficient as (1) it fetches totalCount firstly, and (2) it defines Size(totalCount) which could make 500 server error due to the bucket overflow (what if the result has thousands).
It would be good to iterate in foreach manner but I failed to make them iterable by size 100. I put From/Size or Skip/Take here and there but the returned value was unreliable.
How can I code correctly?
回答1:
This approach may be OK for some sets but a couple of observations:
- Cardinality Aggregation uses HyperLogLog++ algorithm to approximate cardinality; this approximation can be completely accurate for low cardinality fields but less so for high cardinality.
- Terms Aggregation may be computationally expensive for many terms, as each bucket needs to be built in memory, then serialized to response.
You can probably skip the Cardinality Aggregation to get the size, and simply pass int.MaxValue as the size for the Terms Aggregation. An alternative approach that would be less efficient in terms of speed would be to scroll through all documents in the range, source filter to only return the field that you're interested in. I would expect the Scroll approach to put less pressure on the cluster, but I would recommend to monitor any approach that you take.
Here's a comparison of the two approaches on the Stack Overflow data set (taken June 2016, IIRC), looking at unique question askers between 2 years ago today and a year ago today.
Terms Aggregation
void Main()
{
var pool = new SingleNodeConnectionPool(new Uri("http://localhost:9200"));
var connectionSettings = new ConnectionSettings(pool)
.MapDefaultTypeIndices(d => d
.Add(typeof(Question), NDC.StackOverflowIndex)
);
var client = new ElasticClient(connectionSettings);
var twoYearsAgo = DateTime.UtcNow.Date.AddYears(-2);
var yearAgo = DateTime.UtcNow.Date.AddYears(-1);
var searchResponse = client.Search<Question>(s => s
.Size(0)
.Query(q => q
.DateRange(c => c.Field(p => p.CreationDate)
.GreaterThan(twoYearsAgo)
.LessThan(yearAgo)
)
)
.Aggregations(a => a
.Terms("unique_users", c => c
.Field(f => f.OwnerUserId)
.Size(int.MaxValue)
)
)
);
var uniqueOwnerUserIds = searchResponse.Aggs.Terms("unique_users").Buckets.Select(b => b.KeyAsString).ToList();
// 3.83 seconds
// unique question askers: 795352
Console.WriteLine($"unique question askers: {uniqueOwnerUserIds.Count}");
}
Scroll API
void Main()
{
var pool = new SingleNodeConnectionPool(new Uri("http://localhost:9200"));
var connectionSettings = new ConnectionSettings(pool)
.MapDefaultTypeIndices(d => d
.Add(typeof(Question), NDC.StackOverflowIndex)
);
var client = new ElasticClient(connectionSettings);
var uniqueOwnerUserIds = new HashSet<int>();
var twoYearsAgo = DateTime.UtcNow.Date.AddYears(-2);
var yearAgo = DateTime.UtcNow.Date.AddYears(-1);
var searchResponse = client.Search<Question>(s => s
.Source(sf => sf
.Include(ff => ff
.Field(f => f.OwnerUserId)
)
)
.Size(10000)
.Scroll("1m")
.Query(q => q
.DateRange(c => c
.Field(p => p.CreationDate)
.GreaterThan(twoYearsAgo)
.LessThan(yearAgo)
)
)
);
while (searchResponse.Documents.Any())
{
foreach (var document in searchResponse.Documents)
{
if (document.OwnerUserId.HasValue)
uniqueOwnerUserIds.Add(document.OwnerUserId.Value);
}
searchResponse = client.Scroll<Question>("1m", searchResponse.ScrollId);
}
client.ClearScroll(c => c.ScrollId(searchResponse.ScrollId));
// 91.8 seconds
// unique question askers: 795352
Console.WriteLine($"unique question askers: {uniqueOwnerUserIds.Count}");
}
Terms aggregation is ~24 times faster than the Scroll API approach.
来源:https://stackoverflow.com/questions/41739226/elasticsearch-nest-better-code-for-terms-aggregation-and-its-iteration