Partition data around a match query during aggregation

你。 提交于 2021-02-05 11:18:30

问题


What I have been trying to get my head around is to perform some kind of partitioning(split by predicate) in a mongo query. My current query looks like:

db.posts.aggregate([
 {"$match": { $and:[ {$or:[{"toggled":false},{"toggled":true, "status":"INACTIVE"}]}  ,  {"updatedAt":{$gte:1549786260000}} ] }},
 {"$unwind" :"$interests"},
 {"$group" : {"_id": {"iid": "$interests", "pid":"$publisher"}, "count": {"$sum" : 1}}},
 {"$project":{ _id: 0, "iid": "$_id.iid", "pid": "$_id.pid", "count": 1 }}
])

This results in the following output:

{
    "count" : 3.0,
    "iid" : "INT456",
    "pid" : "P789"
}
{
    "count" : 2.0,
    "iid" : "INT789",
    "pid" : "P789"
}
{
    "count" : 1.0,
    "iid" : "INT123",
    "pid" : "P789"
}
{
    "count" : 1.0,
    "iid" : "INT123",
    "pid" : "P123"
}

All good so far, but then I had realized that for the documents that match the specific filter {"toggled":true, "status":"INACTIVE"}, I would rather decrement the count (-1). (considering the eventual value can be negative as well.)

Is there a way to somehow partition the data after match to make sure different grouping operations are performed for both the collection of documents?

Something that sounds similar to what I am looking for is $mergeObjects, or maybe $reduce, but not much that I can relate from the documentation examples.

Note: I can sense, one straightforward way to deal with this would be to perform two queries, but I am looking for a single query to perform the operation.


Sample documents for the above output would be:

/* 1 */
{
    "_id" : ObjectId("5d1f7******"),
    "id" : "CON123",
    "title" : "Game",
    "content" : {},
    "status" : "ACTIVE",
    "toggle":false,
    "publisher" : "P789",
    "interests" : [ 
        "INT456"
    ],
    "updatedAt" : NumberLong(1582078628264)
}

/* 2 */
{
    "_id" : ObjectId("5d1f8******"),
    "id" : "CON456",
    "title" : "Home",
    "content" : {},
    "status" : "INACTIVE",
    "toggle":true,
    "publisher" : "P789",
    "interests" : [ 
        "INT456",
        "INT789"
    ],
    "updatedAt" : NumberLong(1582078628264)
}

/* 3 */
{
    "_id" : ObjectId("5d0e9******"),
    "id" : "CON654",
    "title" : "School",
    "content" : {},
    "status" : "ACTIVE",
    "toggle":false,
    "publisher" : "P789",
    "interests" : [ 
        "INT123",
        "INT456",
        "INT789"
    ],
    "updatedAt" : NumberLong(1582078628264)
}

/* 4 */
{
    "_id" : ObjectId("5d207*******"),
    "id" : "CON789",
    "title":"Stack",
    "content" : { },
    "status" : "ACTIVE",
    "toggle":false,
    "publisher" : "P123",
    "interests" : [ 
        "INT123"
    ],
    "updatedAt" : NumberLong(1582078628264)
}

What I am looking forward to as a result though is

{
    "count" : 1.0, (2-1)
    "iid" : "INT456",
    "pid" : "P789"
}
{
    "count" : 0.0, (1-1)
    "iid" : "INT789",
    "pid" : "P789"
}
{
    "count" : 1.0,
    "iid" : "INT123",
    "pid" : "P789"
}
{
    "count" : 1.0,
    "iid" : "INT123",
    "pid" : "P123"
}

回答1:


This aggregation gives the desired result.

db.posts.aggregate( [
{ $match:  { updatedAt: { $gte: 1549786260000 } } },
{ $facet: {
        FALSE: [
            { $match: { toggle: false } },
            { $unwind : "$interests" },
            { $group : { _id : { iid: "$interests", pid: "$publisher" }, count: { $sum : 1 } } },
        ],
        TRUE: [
            { $match: { toggle: true, status: "INACTIVE" } },
            { $unwind : "$interests" },
            { $group : { _id : { iid: "$interests", pid: "$publisher" }, count: { $sum : -1 } } },
        ]
} },
{ $project: { result: { $concatArrays: [ "$FALSE", "$TRUE" ] } } },
{ $unwind: "$result" },
{ $replaceRoot: { newRoot: "$result" } },
{ $group : { _id : "$_id", count: { $sum : "$count" } } },
{ $project:{ _id: 0, iid: "$_id.iid", pid: "$_id.pid", count: 1 } }
] )


[ EDIT ADD ]

The output from the query using the input data from the question post:

{ "count" : 1, "iid" : "INT123", "pid" : "P789" }
{ "count" : 1, "iid" : "INT123", "pid" : "P123" }
{ "count" : 0, "iid" : "INT789", "pid" : "P789" }
{ "count" : 1, "iid" : "INT456", "pid" : "P789" }



[ EDIT ADD 2 ]

This query gets the same result with different approach (code):

db.posts.aggregate( [
  { 
      $match:  { updatedAt: { $gte: 1549786260000 } } 
  },
  { 
      $unwind : "$interests" 
  },
  { 
      $group : { 
          _id : { 
              iid: "$interests", 
              pid: "$publisher" 
          }, 
          count: { 
              $sum: {
                  $switch: {
                      branches: [
                        { case: { $eq: [ "$toggle", false ] },
                           then: 1 },
                        { case: { $and: [ { $eq: [ "$toggle", true] },  { $eq: [ "$status", "INACTIVE" ] } ] },
                           then: -1 }
                      ]
                  }          
              } 
          }
      } 
  },
  { 
      $project:{
           _id: 0, 
           iid: "$_id.iid", 
           pid: "$_id.pid", 
           count: 1 
      } 
  }
] )


[ EDIT ADD 3 ]

NOTE:

The facet query runs the two facets (TRUE and FALSE) on the same set of documents; it is like two queries running in parallel. But, there is some duplication of code as well as additional stages for shaping the documents down the pipeline to get the desired output.

The second query avoids the code duplication, and there are much lesser stages in the aggregation pipeline. This will make difference when the input dataset has a large number of documents to process - in terms of performance. In general, lesser stages means lesser iterations of the documents (as a stage has to scan the documents which are output from the previous stage).



来源:https://stackoverflow.com/questions/60285951/partition-data-around-a-match-query-during-aggregation

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!