Aggregate $lookup Total size of documents in matching pipeline exceeds maximum document size

匿名 (未验证) 提交于 2019-12-03 08:33:39

问题:

I have a pretty simple $lookup aggregation query like the following:

{'$lookup':  {'from': 'edge',   'localField': 'gid',   'foreignField': 'to',   'as': 'from'}} 

When I run this on a match with enough documents I get the following error:

Command failed with error 4568: 'Total size of documents in edge matching { $match: { $and: [ { from: { $eq: "geneDatabase:hugo" } }, {} ] } } exceeds maximum document size' on server 

All attempts to limit the number of documents fail. allowDiskUse: true does nothing. Sending a cursor in does nothing. Adding in a $limit into the aggregation also fails.

How could this be?

Then I see the error again. Where did that $match and $and and $eq come from? Is the aggregation pipeline behind the scenes farming out the $lookup call to another aggregation, one it runs on its own that I have no ability to provide limits for or use cursors with??

What is going on here?

回答1:

As stated earlier in comment, the error occurs because when performing the $lookup which by default produces a target "array" within the parent document from the results of the foreign collection, the total size of documents selected for that array causes the parent to exceed the 16MB BSON Limit.

The counter for this is to process with an $unwind which immediately follows the $lookup pipeline stage. This actually alters the behavior of $lookup in such that instead of producing an array in the parent, the results are instead a "copy" of each parent for every document matched.

Pretty much just like regular usage of $unwind, with the exception that instead of processing as a "separate" pipeline stage, the unwinding action is actually added to the $lookup pipeline operation itself. Ideally you also follow the $unwind with a $match condition, which also creates a matching argument to also be added to the $lookup. You can actually see this in the explain output for the pipeline.

The topic is actually covered (briefly) in a section of Aggregation Pipeline Optimization in the core documentation:

$lookup + $unwind Coalescence

New in version 3.2.

When a $unwind immediately follows another $lookup, and the $unwind operates on the as field of the $lookup, the optimizer can coalesce the $unwind into the $lookup stage. This avoids creating large intermediate documents.

Best demonstrated with a listing that puts the server under stress by creating "related" documents that would exceed the 16MB BSON limit. Done as briefly as possible to both break and work around the BSON Limit:

const MongoClient = require('mongodb').MongoClient;  const uri = 'mongodb://localhost/test';  function data(data) {   console.log(JSON.stringify(data, undefined, 2)) }  (async function() {    let db;    try {     db = await MongoClient.connect(uri);      console.log('Cleaning....');     // Clean data     await Promise.all(       ["source","edge"].map(c => db.collection(c).remove() )     );      console.log('Inserting...')      await db.collection('edge').insertMany(       Array(1000).fill(1).map((e,i) => ({ _id: i+1, gid: 1 }))     );     await db.collection('source').insert({ _id: 1 })      console.log('Fattening up....');     await db.collection('edge').updateMany(       {},       { $set: { data: "x".repeat(100000) } }     );      // The full pipeline. Failing test uses only the $lookup stage     let pipeline = [       { $lookup: {         from: 'edge',         localField: '_id',         foreignField: 'gid',         as: 'results'       }},       { $unwind: '$results' },       { $match: { 'results._id': { $gte: 1, $lte: 5 } } },       { $project: { 'results.data': 0 } },       { $group: { _id: '$_id', results: { $push: '$results' } } }     ];      // List and iterate each test case     let tests = [       'Failing.. Size exceeded...',       'Working.. Applied $unwind...',       'Explain output...'     ];      for (let [idx, test] of Object.entries(tests)) {       console.log(test);        try {         let currpipe = (( +idx === 0 ) ? pipeline.slice(0,1) : pipeline),             options = (( +idx === tests.length-1 ) ? { explain: true } : {});          await new Promise((end,error) => {           let cursor = db.collection('source').aggregate(currpipe,options);           for ( let [key, value] of Object.entries({ error, end, data }) )             cursor.on(key,value);         });       } catch(e) {         console.error(e);       }      }    } catch(e) {     console.error(e);   } finally {     db.close();   }  })(); 

After inserting some initial data, the listing will attempt to run an aggregate merely consisting of $lookup which will fail with the following error:

{ MongoError: Total size of documents in edge matching pipeline { $match: { $and : [ { gid: { $eq: 1 } }, {} ] } } exceeds maximum document size

Which is basically telling you the BSON limit was exceeded on retrieval.

By contrast the next attempt adds the $unwind and $match pipeline stages

The Explain output:

  {     "$lookup": {       "from": "edge",       "as": "results",       "localField": "_id",       "foreignField": "gid",       "unwinding": {                        // $unwind now is unwinding         "preserveNullAndEmptyArrays": false       },       "matching": {                         // $match now is matching         "$and": [                           // and actually executed against            {                                 // the foreign collection             "_id": {               "$gte": 1             }           },           {             "_id": {               "$lte": 5             }           }         ]       }     }   },   // $unwind and $match stages removed   {     "$project": {       "results": {         "data": false       }     }   },   {     "$group": {       "_id": "$_id",       "results": {         "$push": "$results"       }     }   } 

And that result of course succeeds, because as the results are no longer being placed into the parent document then the BSON limit cannot be exceeded.

This really just happens as a result of adding $unwind only, but the $match is added for example to show that this is also added into the $lookup stage and that the overall effect is to "limit" the results returned in an effective way, since it's all done in that $lookup operation and no other results other than those matching are actually returned.

By constructing in this way you can query for "referenced data" that would exceed the BSON limit and then if you want $group the results back into an array format, once they have been effectively filtered by the "hidden query" that is actually being performed by $lookup.



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!