Limit aggregation in grouped aggregation

问题

I had a collection like this, but with much more data.

{
  _id: ObjectId("db759d014f70743495ef1000"),
  tracked_item_origin: "winword",
  tracked_item_type: "Software",
  machine_user: "mmm.mmm",
  organization_id: ObjectId("a91864df4f7074b33b020000"),
  group_id: ObjectId("20ea74df4f7074b33b520000"),
  tracked_item_id: ObjectId("1a050df94f70748419140000"),
  tracked_item_name: "Word",
  duration: 9540,
}

{
  _id: ObjectId("2b769d014f70743495fa1000"),
  tracked_item_origin: "http://www.facebook.com",
  tracked_item_type: "Site",
  machine_user: "gabriel.mello",
  organization_id: ObjectId("a91864df4f7074b33b020000"),
  group_id: ObjectId("3f6a64df4f7074b33b040000"),
  tracked_item_id: ObjectId("6f3466df4f7074b33b080000"),
  tracked_item_name: "Facebook",
  duration: 7920,
}

I do an aggregation, ho return grouped data like this:

{"_id"=>{"tracked_item_type"=>"Site", "tracked_item_name"=>"Twitter"}, "duration"=>288540},
{"_id"=>{"tracked_item_type"=>"Site", "tracked_item_name"=>"ANoticia"}, "duration"=>237300},
{"_id"=>{"tracked_item_type"=>"Site", "tracked_item_name"=>"Facebook"}, "duration"=>203460},
{"_id"=>{"tracked_item_type"=>"Software", "tracked_item_name"=>"Word"}, "duration"=>269760},
{"_id"=>{"tracked_item_type"=>"Software", "tracked_item_name"=>"Excel"}, "duration"=>204240}

Simple aggregation code:

AgentCollector.collection.aggregate(
  {'$match' => {group_id: '20ea74df4f7074b33b520000'}},
  {'$group' => {
    _id: {tracked_item_type: '$tracked_item_type', tracked_item_name: '$tracked_item_name'},
    duration: {'$sum' => '$duration'}
  }},
  {'$sort' => {
    '_id.tracked_item_type' => 1,
    duration: -1
  }}
)

There is a way to limit only 2 items by tracked_item_type key? Ex. 2 Sites and 2 Softwares.

回答1:

As your question currently stands unclear, I really hope you mean that you want to specify two Site keys and 2 Software keys because that's a nice and simple answer that you can just add to your $match phase as in:

{$match: {
    group_id: "20ea74df4f7074b33b520000",
    tracked_item_name: {$in: ['Twitter', 'Facebook', 'Word', 'Excel' ] }
}},

And we can all cheer and be happy ;)

If however your question is something more diabolical such as, getting the top 2 Sites and Software entries from the result by duration, then we thank you very much for spawning this abomination.

Warning:

Your mileage may vary on what you actually want to do or whether this is going to blow up by the sheer size of your results. But this follows as an example of what you are in for:

db.collection.aggregate([

    // Match items first to reduce the set
    {$match: {group_id: "20ea74df4f7074b33b520000" }},

    // Group on the types and "sum" of duration
    {$group: {
        _id: {
            tracked_item_type: "$tracked_item_type",
            tracked_item_name: "$tracked_item_name"
         },
         duration: {$sum: "$duration"}
    }},

    // Sort by type and duration descending
    {$sort: { "_id.tracked_item_type": 1, duration: -1 }},

    /* The fun part */

    // Re-shape results to "sites" and "software" arrays 
    {$group: { 
        _id: null,
        sites: {$push:
            {$cond: [
                {$eq: ["$_id.tracked_item_type", "Site" ]},
                { _id: "$_id", duration: "$duration" },
                null
            ]}
        },
        software: {$push:
            {$cond: [
                {$eq: ["$_id.tracked_item_type", "Software" ]},
                { _id: "$_id", duration: "$duration" },
                null
            ]}
        }
    }},


    // Remove the null values for "software"
    {$unwind: "$software"},
    {$match: { software: {$ne: null} }},
    {$group: { 
        _id: "$_id",
        software: {$push: "$software"}, 
        sites: {$first: "$sites"} 
    }},

    // Remove the null values for "sites"
    {$unwind: "$sites"},
    {$match: { sites: {$ne: null} }},
    {$group: { 
        _id: "$_id",
        software: {$first: "$software"},
        sites: {$push: "$sites"} 
    }},


    // Project out software and limit to the *top* 2 results
    {$unwind: "$software"},
    {$project: { 
        _id: 0,
        _id: { _id: "$software._id", duration: "$software.duration" },
        sites: "$sites"
    }},
    {$limit : 2},


    // Project sites, grouping multiple software per key, requires a sort
    // then limit the *top* 2 results
    {$unwind: "$sites"},
    {$group: {
        _id: { _id: "$sites._id", duration: "$sites.duration" },
        software: {$push: "$_id" }
    }},
    {$sort: { "_id.duration": -1 }},
    {$limit: 2}

])

Now what that results in is *not exactly the clean set of results that would be ideal but it is something that can be programatically worked with, and better than filtering the previous results in a loop. (My data from testing)

{
    "result" : [
        {
            "_id" : {
                "_id" : {
                    "tracked_item_type" : "Site",
                    "tracked_item_name" : "Digital Blasphemy"
                 },
                 "duration" : 8000
            },
            "software" : [
                {
                    "_id" : {
                        "tracked_item_type" : "Software",
                        "tracked_item_name" : "Word"
                    },
                    "duration" : 9540
                },

                {
                    "_id" : {
                        "tracked_item_type" : "Software",
                        "tracked_item_name" : "Notepad"
                    },
                    "duration" : 4000
                }
            ]
        },
        {
            "_id" : {
                "_id" : {
                    "tracked_item_type" : "Site",
                    "tracked_item_name" : "Facebook"
                 },
                 "duration" : 7920
            },
            "software" : [
                {
                    "_id" : {
                        "tracked_item_type" : "Software",
                         "tracked_item_name" : "Word"
                    },
                    "duration" : 9540
                },
                {
                    "_id" : {
                        "tracked_item_type" : "Software",
                        "tracked_item_name" : "Notepad"
                    },
                    "duration" : 4000
                }
            ]
        }
    ],
    "ok" : 1
}

So you see you get the top 2 Sites in the array, with the top 2 Software items embedded in each. Aggregation itself, cannot clear this up any further, because we would need to re-merge the items we split apart in order to do this, and as yet there is no operator that we could use to perform this action.

But that was fun. It's not all the way done, but most of the way, and making that into a 4 document response would be relatively trivial code. But my head hurts already.

来源：https://stackoverflow.com/questions/21913335/limit-aggregation-in-grouped-aggregation

标签

ruby

mongodb

mongodb-query

mongoid

aggregation-framework