mongodb group values by multiple fields

匿名 (未验证) 提交于 2019-12-03 01:18:02

问题:

For example, I have these documents:

{   addr: 'address1'   book: 'book1' }, {   addr: 'address2'   book: 'book1' }, {   addr: 'address1'   book: 'book5' }, {   addr: 'address3'   book: 'book9' }, {   addr: 'address2'   book: 'book5' }, {   addr: 'address2'   book: 'book1' }, {   addr: 'address1'   book: 'book1' }, {   addr: 'address15'   book: 'book1' }, {   addr: 'address9'   book: 'book99' }, {   addr: 'address90'   book: 'book33' }, {   addr: 'address4'   book: 'book3' }, {   addr: 'address5'   book: 'book1' }, {   addr: 'address77'   book: 'book11' }, {   addr: 'address1'   book: 'book1' } 

and so on.


How can I make a request, which will describe the top N addresses and the top M books per address?

Example of expected result:

address1 | book_1: 5
| book_2: 10
| book_3: 50
| total: 65
______________________
address2 | book_1: 10
| book_2: 10
|...
| book_M: 10
| total: M*10
...
______________________
addressN | book_1: 20
| book_2: 20
|...
| book_M: 20
| total: M*20

回答1:

TLDR Summary

In modern MongoDB releases you can brute force this with $slice just off the basic aggregation result. For "large" results, run parallel queries instead for each grouping, or wait for SERVER-9377 to resolve, which would allow a "limit" the the number of items to $push to an array.

db.books.aggregate([     { "$group": {         "_id": {             "addr": "$addr",             "book": "$book"         },         "bookCount": { "$sum": 1 }     }},     { "$group": {         "_id": "$_id.addr",         "books": {              "$push": {                  "book": "$_id.book",                 "count": "$bookCount"             },         },         "count": { "$sum": "$bookCount" }     }},     { "$sort": { "count": -1 } },     { "$limit": 2 },     { "$project": {         "books": { "$slice": [ "$books", 2 ] },         "count": 1     }} ]) 

MongoDB 3.6 Preview

Still not resolving SERVER-9377, but in this release $lookup allows a new "non-correlated" option which takes an "pipeline" expression as an argument instead of the "localFields" and "foreignFields" options. This then allows a "self-join" with another pipeline expression, in which we can apply $limit in order to return the "top-n" results.

db.books.aggregate([   { "$group": {     "_id": "$addr",     "count": { "$sum": 1 }   }},   { "$sort": { "count": -1 } },   { "$limit": 2 },   { "$lookup": {     "from": "books",     "let": {       "addr": "$_id"     },     "pipeline": [       { "$match": {          "$expr": { "$eq": [ "$addr", "$$addr"] }       }},       { "$group": {         "_id": "$book",         "count": { "$sum": 1 }       }},       { "$sort": { "count": -1  } },       { "$limit": 2 }     ],     "as": "books"   }} ]) 

The other addition here is of course the ability to interpolate the variable through $expr using $match to select the matching items in the "join", but the general premise is a "pipeline within a pipeline" where the inner content can be filtered by matches from the parent. Since they are both "pipelines" themselves we can $limit each result separately.

This would be the next best option to running parallel queries, and actually would be better if the $match were allowed and able to use an index in the "sub-pipeline" processing. So which is does not use the "limit to $push" as the referenced issue asks, it actually delivers something that should work better.


Original Content

You seem have stumbled upon the top "N" problem. In a way your problem is fairly easy to solve though not with the exact limiting that you ask for:

db.books.aggregate([     { "$group": {         "_id": {             "addr": "$addr",             "book": "$book"         },         "bookCount": { "$sum": 1 }     }},     { "$group": {         "_id": "$_id.addr",         "books": {              "$push": {                  "book": "$_id.book",                 "count": "$bookCount"             },         },         "count": { "$sum": "$bookCount" }     }},     { "$sort": { "count": -1 } },     { "$limit": 2 } ]) 

Now that will give you a result like this:

{     "result" : [             {                     "_id" : "address1",                     "books" : [                             {                                     "book" : "book4",                                     "count" : 1                             },                             {                                     "book" : "book5",                                     "count" : 1                             },                             {                                     "book" : "book1",                                     "count" : 3                             }                     ],                     "count" : 5             },             {                     "_id" : "address2",                     "books" : [                             {                                     "book" : "book5",                                     "count" : 1                             },                             {                                     "book" : "book1",                                     "count" : 2                             }                     ],                     "count" : 3             }     ],     "ok" : 1 } 

So this differs from what you are asking in that, while we do get the top results for the address values the underlying "books" selection is not limited to only a required amount of results.

This turns out to be very difficult to do, but it can be done though the complexity just increases with the number of items you need to match. To keep it simple we can keep this at 2 matches at most:

db.books.aggregate([     { "$group": {         "_id": {             "addr": "$addr",             "book": "$book"         },         "bookCount": { "$sum": 1 }     }},     { "$group": {         "_id": "$_id.addr",         "books": {              "$push": {                  "book": "$_id.book",                 "count": "$bookCount"             },         },         "count": { "$sum": "$bookCount" }     }},     { "$sort": { "count": -1 } },     { "$limit": 2 },     { "$unwind": "$books" },     { "$sort": { "count": 1, "books.count": -1 } },     { "$group": {         "_id": "$_id",         "books": { "$push": "$books" },         "count": { "$first": "$count" }     }},     { "$project": {         "_id": {             "_id": "$_id",             "books": "$books",             "count": "$count"         },         "newBooks": "$books"     }},     { "$unwind": "$newBooks" },     { "$group": {       "_id": "$_id",       "num1": { "$first": "$newBooks" }     }},     { "$project": {         "_id": "$_id",         "newBooks": "$_id.books",         "num1": 1     }},     { "$unwind": "$newBooks" },     { "$project": {         "_id": "$_id",         "num1": 1,         "newBooks": 1,         "seen": { "$eq": [             "$num1",             "$newBooks"         ]}     }},     { "$match": { "seen": false } },     { "$group":{         "_id": "$_id._id",         "num1": { "$first": "$num1" },         "num2": { "$first": "$newBooks" },         "count": { "$first": "$_id.count" }     }},     { "$project": {         "num1": 1,         "num2": 1,         "count": 1,         "type": { "$cond": [ 1, [true,false],0 ] }     }},     { "$unwind": "$type" },     { "$project": {         "books": { "$cond": [             "$type",             "$num1",             "$num2"         ]},         "count": 1     }},     { "$group": {         "_id": "$_id",         "count": { "$first": "$count" },         "books": { "$push": "$books" }     }},     { "$sort": { "count": -1 } } ]) 

So that will actually give you the top 2 "books" from the top two "address" entries.

But for my money, stay with the first form and then simply "slice" the elements of the array that are returned to take the first "N" elements.



回答2:

Using aggregate function like below :

[ {$group: {_id : {book : '$book',address:'$addr'}, total:{$sum :1}}}, {$project : {book : '$_id.book', address : '$_id.address', total : '$total', _id : 0}} ] 

it will give you result like following :

        {             "total" : 1,             "book" : "book33",             "address" : "address90"         },          {             "total" : 1,             "book" : "book5",             "address" : "address1"         },          {             "total" : 1,             "book" : "book99",             "address" : "address9"         },          {             "total" : 1,             "book" : "book1",             "address" : "address5"         },          {             "total" : 1,             "book" : "book5",             "address" : "address2"         },          {             "total" : 1,             "book" : "book3",             "address" : "address4"         },          {             "total" : 1,             "book" : "book11",             "address" : "address77"         },          {             "total" : 1,             "book" : "book9",             "address" : "address3"         },          {             "total" : 1,             "book" : "book1",             "address" : "address15"         },          {             "total" : 2,             "book" : "book1",             "address" : "address2"         },          {             "total" : 3,             "book" : "book1",             "address" : "address1"         } 

I didn't quite get your expected result format, so feel free to modify this to one you need.



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!