multi sum/count on mongodb (sum gender and total all results)

问题

i have this document:

{gender:"male", ...},
{gender:"male", ...},
{gender:"female", ...},
{gender:"female", ...},
{gender:"female", ...},

So, i need retrive like

{
total:5,
male:2,
female:3
}

my actual query(no work):

db.collection.aggregate([
{
    $match:{...}
},
{
    $group:{
        _id:"$_id",
        gender:{
            $push:"$gender"
        },
        total:{
            $sum:1
        }
    }
},
{
    $unwind:"$gender"
},
{
    $group:{
        _id:"$gender",
        name:{
            $addToSet:"$all"
        },
        "count":{
            $sum:1
        }
    }
}
])

how i can retrive counter of gender and total? thanks

回答1:

Something like that will do the trick:

db.collection.aggregate([
  {$project: {
    male: {$cond: [{$eq: ["$gender", "male"]}, 1, 0]},
    female: {$cond: [{$eq: ["$gender", "female"]}, 1, 0]},
  }},
  {$group: { _id: null, male: {$sum: "$male"},
                        female: {$sum: "$female"},
                        total: {$sum: 1},
  }},
])

Producing given your example:

{ "_id" : null, "male" : 2, "female" : 3, "total" : 5 }

The key idea is to use a conditional expression to map the gender to 0 or 1. After that, all you need is a simple sum over each field.

回答2:

It is possible to get the result in a number of ways, but it does help to understand how to get the result and what the aggregation pipeline is doing.

So the general case here is to test the value of "gender" and then decide whether to accumulate a total for that gender or not. So the fields can be seperated by logic using an $eq test in the $cond operator. But the most efficient way is to process directly to $group:

var start = Date.now();
db.people.aggregate([
    { "$group": {
        "_id": null,
        "male": { 
            "$sum": { 
                "$cond": [
                    { "$eq": ["male","$gender"] },
                   1,
                   0
                ]
            }
        },
        "female": { 
            "$sum": { 
                "$cond": [
                    { "$eq": ["female","$gender"] },
                   1,
                   0
                ]
            }
        },
        "total": { "$sum": 1 }
    }}
])
var end = Date.now();
end - start;

Now on my little laptop with a reasonably even sample of random "gender" that pipeline takes around 290ms to run consistently as every document is evaluated for which fields to total and summed at the same time.

On the other hand, if you write in a $project stage as has been suggested elsewhere:

var start = Date.now();
db.people.aggregate([
    { "$project": {
        "male": { 
            "$cond": [
                { "$eq": ["male","$gender"] },
               1,
               0
            ]
        },
        "female": { 
            "$cond": [
                { "$eq": ["female","$gender"] },
               1,
               0
            ]
        },
    }},
    { "$group": {
        "_id": null,
        "male": { "$sum": "$male" },
        "female": { "$sum": "$female" },
        "total": { "$sum": 1 }
    }}
])
var end = Date.now();
end - start;

Then the average result is coming it at 460ms to run the pipeline, and that's getting close to "double" the time. So what is going on here?

Basically $project needs to process every document in the collection "before" they are sent to the $group stage, so that is exactly what it does. You have a pipeline there altering the structure of each (100000 in test) document before you are going to do anything else with it.

This is where it helps to be able to look "logically" at this, and say "Hang on a moment, why am I doing that there when I could do that here", then coming to the realization that all of the logic compacts into a single stage.

This is what design and optimization are all about. So if you are going to learn, then it helps to learn the right way.

Sample generation:

var bulk = db.people.initializeOrderedBulkOp(),
    count = 0,
    gender = ["male","female"];

for ( var x=1; x<=100000; x++ ) {
    bulk.insert({
        "gender": gender[Math.floor(Math.random()*2)]
    });
    count++;

    if ( count % 1000 == 0 ) {
        bulk.execute();
        bulk = db.people.initializeOrderedBulkOp();
    }
}

Results of both pipelines:

{ "_id" : null, "male" : 50086, "female" : 49914, "total" : 100000 }

Timings

The "in client" timings provided in the main body of course incllude the acutal overhead time of client interpretation and execution, as well as transfer though this is on a local server.

I did a re-run an analysed the logs on a fresh MongoDB 3.0.3 Ubuntu 15.04 VM ( 2GB Allocation with 4 cores assigned ) on a pretty old Intel Core i7 laptop host with 8GB and Windows 7 64-bit that I never bothered to overwrite.

The actual timings on server only from logs on average for 1000 executions each ( warmed up ):

single $group optimal: avg: 185ms min: 98ms max: 205ms

seprate $project: avg: 330ms min: 316ms max: 410ms

So acutally a "little" slightly nearer to "worse" being almost double the time by a much closer margin. But that is exactly what I would expect from the results. Therefore nearly 50% of the overall "cost" here is loading and processing the data into the pipeline in memory. So the difference comes from being able to reduce the result at the same time as loading and processing.

来源：https://stackoverflow.com/questions/31484316/multi-sum-count-on-mongodb-sum-gender-and-total-all-results

标签

mongodb

aggregation-framework