Mongodb aggregation pipeline size and speed issue

淺唱寂寞╮ 提交于 2019-12-22 08:10:05

问题


I'm trying to use a mongodb aggregation query to join($lookup) two collections and then distinct count all the unique values in the joined array. *Note: I don't necessarily know what fields(keys) are in the metaDataMap array. And I don't want to count or include fields that might or might not exist in the Map. So that's why the aggregation query looks like it does.

So my two collections look like this: events-

{
"_id" : "1",
"name" : "event1",
"objectsIds" : [ "1", "2", "3" ],
}

Objects

{
"_id" : "1",
"name" : "object1",
"metaDataMap" : { 
                     "SOURCE" : ["ABC", "DEF"],
                     "DESTINATION" : ["XYZ", "PDQ"],
                     "TYPE" : []
                }
},
{
"_id" : "2",
"name" : "object2",
"metaDataMap" : { 
                     "SOURCE" : ["RST", "LNE"],
                     "TYPE" : ["text"]
                }
},
{
"_id" : "3",
"name" : "object3",
"metaDataMap" : { 
                     "SOURCE" : ["NOP"],
                     "DESTINATION" : ["PHI", "NYC"],
                     "TYPE" : ["video"]
                }
}

My results are

{
_id:"SOURCE", count:5
_id:"DESTINATION", count: 4
_id:"TYPE", count: 2
}

What I have so far is this:

db.events.aggregate([
{$match: {"_id" : id}}

,{$lookup: {"from" : "objects",
        "localField" : "objectsIds",
        "foreignField" : "_id",
        "as" : "objectResults"}}

,{$unwind: "$objectResults"} //Line 1
,{$project: {x: "$objectResults.metaDataMap"}} //Line 2


,{$unwind: "$x"}
,{$project: {"_id":0}}

,{$project: {x: {$objectToArray: "$x"}}}
,{$unwind: "$x"}

,{$group: {_id: "$x.k", tmp: {$push: "$x.v"}}}

,{$addFields: {tmp: {$reduce:{
input: "$tmp",
initialValue:[],
in:{$concatArrays: [ "$$value", "$$this"]}
    }}
}}

,{$unwind: "$tmp"}
,{$group: {_id: "$_id", uniqueVals: {$addToSet: "$tmp"}}}

,{$addFields: {count: {"$size":"$uniqueVals"}}}
,{$project: {_id: "$_id", count: "$count"}}
]);

My issue is were I marked line 1&2. The above works but takes around 50 seconds for 25,000 values in the metaDataMap array fields(objectsResults.metaDataMap). So for example having a 25,000 values in object 1 metaDataMap SOURCE array. That's way to slow. My other faster way to do it was to replace line 1&2 with:

 ,{$project: {x: "$objectResults.metaDataMap"}} //Line 1
 ,{$unwind: "$x"} //Line 2

This is way faster (under 3 seconds) but can only be run on datasets that have ~10,000 items or less. Anything higher and I get a error saying "exceeds maximum document size".

Please help!


回答1:


If you're able to alter your schema design on the object collection to include a parent_id field, you can immediately remove the first 4 stages of your pipeline (the first $match, $lookup, $unwind, and $project). This will make the concern about Line 1 and Line 2 disappear.

For example, a document in the object collection would look like:

{
  "_id": "1",
  "name": "object1",
  "metaDataMap": {
    "SOURCE": [
      "ABC",
      "DEF"
    ],
    "DESTINATION": [
      "XYZ",
      "PDQ"
    ],
    "TYPE": [ ]
  },
  "parent_id": "1"
}

Thus you don't need the expensive $lookup and $unwind. The first 4 stages can then be replaced with:

{$match: {parent_id: id}}

Based on this idea, I did further optimization of the pipeline, which resulted in:

db.objects.aggregate([
     {$match: {parent_id: id}}
    ,{$project: {metaDataMap: {$filter: {input: {$objectToArray: '$metaDataMap'}, cond: {$ne: [[], '$$this.v']}}}}}
    ,{$unwind: '$metaDataMap'}
    ,{$unwind: '$metaDataMap.v'}
    ,{$group: {_id: '$metaDataMap.k', val: {$addToSet: '$metaDataMap.v'}}}
    ,{$project: {count: {$size: '$val'}}}
])

This will output:

{ "_id": "TYPE", "count": 2 }
{ "_id": "DESTINATION", "count": 4 }
{ "_id": "SOURCE", "count": 5 }


来源:https://stackoverflow.com/questions/46692892/mongodb-aggregation-pipeline-size-and-speed-issue

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!