Find largest document size in MongoDB

前端 未结 7 563
耶瑟儿~
耶瑟儿~ 2020-12-01 01:33

Is it possible to find the largest document size in MongoDB?

db.collection.stats() shows average size, which is not really representative because in my

7条回答
  •  忘掉有多难
    2020-12-01 02:02

    Finding the largest documents in a MongoDB collection can be ~100x faster than the other answers using the aggregation framework and a tiny bit of knowledge about the documents in the collection. Also, you'll get the results in seconds, vs. minutes with the other approaches (forEach, or worse, getting all documents to the client).

    You need to know which field(s) in your document might be the largest ones - which you almost always will know. There are only two practical1 MongoDB types that can have variable sizes:

    • arrays
    • strings

    The aggregation framework can calculate the length of each. Note that you won't get the size in bytes for arrays, but the length in elements. However, what matters more typically is which the outlier documents are, not exactly how many bytes they take.

    Here's how it's done for arrays. As an example, let's say we have a collections of users in a social network and we suspect the array friends.ids might be very large (in practice you should probably keep a separate field like friendsCount in sync with the array, but for the sake of example, we'll assume that's not available):

    db.users.aggregate([
        { $match: {
            'friends.ids': { $exists: true }
        }},
        { $project: { 
            sizeLargestField: { $size: '$friends.ids' } 
        }},
        { $sort: {
            sizeLargestField: -1
        }},
    ])
    

    The key is to use the $size aggregation pipeline operator. It only works on arrays though, so what about text fields? We can use the $strLenBytes operator. Let's say we suspect the bio field might also be very large:

    db.users.aggregate([
        { $match: {
            bio: { $exists: true }
        }},
        { $project: { 
            sizeLargestField: { $strLenBytes: '$bio' } 
        }},
        { $sort: {
            sizeLargestField: -1
        }},
    ])
    

    You can also combine $size and $strLenBytes using $sum to calculate the size of multiple fields. In the vast majority of cases, 20% of the fields will take up 80% of the size (if not 10/90 or even 1/99), and large fields must be either strings or arrays.


    1 Technically, the rarely used binData type can also have variable size.

提交回复
热议问题