问题
From a MongoDB aggregation returning a single record for each hour, I also need to know the 'mode' or most frequently occurring value in a field.
So far I have selected the set of records between two dates, and am returning a single record for each hour including an average of a field value. But I also need the most frequent category where category number field containing 1,2,3 or 4.
var myName = "CollectionName"
//schema for mongoose
var mySchema = new Schema({
dt: Date,
value: Number,
category: Number
});
var myDB = mongoose.createConnection('mongodb://localhost:27017/MYDB');
myDBObj = myDB.model(myName, evalSchema, myName);
The date math in the following $group creates a record for each hour in the day, and the $avg averages the price field....
but I cant figure how to return the most frequent occurrence of 1,2,3 or 4 in the category field... there is no $mode aggregation operator and I get the error "exception: unknown group operator '$mode'"
myDBObj.aggregate([
{
$match: { "dt": { $gt: new Date("October 13, 2010 12:00:00"), $lt: new Date("November 13, 2010 12:00:00") } }
},{
$group: {
"_id": {
"dt": {
"$add": [
{
"$subtract": [
{ "$subtract": ["$dt", new Date(0)] },
{
"$mod": [
{ "$subtract": ["$dt", new Date(0)] },
3600000//1000 * 60 * 60
]
}
]
},
new Date(0)
]
}
},
"price": { "$avg": "$price" },
"category" : { "$mode" : "$category"}
}
}], function (err, data) { if (err) { return next(err); } res.json(data); });
Is there a way to return the most common value contained in a field?
Do I need to use map-reduce functions? How would I combine them with the hourly aggregation above? Thank you for any help.
回答1:
Well you cannot just "make up". operators as $mode
is not an aggrgegation operator, and the only things you can use are those that actually exist.
So in order to return the category value within the grouped time period that occurs the most, it is necessary to group first on each of those values and return the count of occurances. Then you can order these results by that count, and return the category value that recorded the highest count within that period:
// Filter dates
{ "$match": {
"dt": {
"$gt": new Date("October 13, 2010 12:00:00"),
"$lt": new Date("November 13, 2010 12:00:00")
}
}},
// Group by hour and category, with avg and count
{ "$group": {
"_id": {
"dt": {
"$add": [
{
"$subtract": [
{ "$subtract": ["$dt", new Date(0)] },
{
"$mod": [
{ "$subtract": ["$dt", new Date(0)] },
3600000//1000 * 60 * 60
]
}
]
},
new Date(0)
]
},
"category": "$category"
},
"price": { "$avg": "$price" },
"count": { "$sum": 1 }
}},
// Sort on date and count
{ "$sort": { "_id.dt": 1, "count": -1 }},
// Group on just the date, keeping the avg and the first category
{ "$group": {
"_id": "$_id.dt",
"price": { "$avg": "$price"}
"category": { "$first": "$_id.category" }
}}
So $group on both date and category and retain the category count via $sum. Then you $sort so the largest "count" is on top for each grouped date. And finally use $first when you apply another $group
that is just applied to the date itself, in order to return that category with the largest count for each date.
Don't be tempted by operators like $max
as they do not work here. The key difference is the "tied" releation to the "record/document" produced for each category value. So it is not the maximim "count" you want or the maximum "category" value, but instead the category value that "produced" the largest count. Hence there is a $sort
needed here.
Finally some habits you "should" break:
Don't use non UTC format date instance data as input unless you really know what you are doing. Dates are going to be converted to UTC always, so at least in test listings, you should get used to specifying the date value that way.
It might look a bit cleaner the other way but things like
1000 * 60 * 60
are a lot more descrpitive code of what it is doing than3600000
. Same value, but one form is indicative of it's time units at a glance.Compounding
_id
when there is only a single value can also confuse issues. So there is little point in accesssing_id.dt
if that was the only value present. When is more than a single property within_id
then it is fine. But single values should just be assigned right back to_id
alone. Nothing gained otherwise, and single is quite clear.
来源:https://stackoverflow.com/questions/33708233/need-to-find-the-most-frequently-occurring-value-of-a-field-in-a-aggregate