Fragmentation in MongoDB when growing documents

问题

Seems like a blog with comments is the standard example used for describing different modeling strategies when using MongoDB.

My question relates to the model where comments are modeled as a sub collection on a single blog post document (i.e one document stores everything related to a single blog post).

In the case of multiple simultaneous writes it seems like you would avoid overwriting previous updates if you use upserts and targeted update modifiers (like push). Meaning, saving the document for every comment added would not overwrite previously added comments. However, how does fragmentation come into play here? Is it realistic to assume that adding multiple comments over time will result in fragmented memory and potentially slower queries? Are there any guidelines for growing a document through sub collections?

I am also aware of the 16MB limit per document, but that to me seems like a theoretical limit since 16 MB would be an enormous amount of text. In the event of fragmentation, would the documents be compacted the next time the mongo instance is restarted and reads the database back into memory?

I know the way you expect to interact with the data is the best guiding principle for how to model the data (needing comments without the blog post parent etc). However I am interested in learning about potential issues with the highly denormalized single document approach. Are the issues I'm describing even realistic in the given blog post example?

回答1:

Before answer your questions, I try to explain the storage mechanics of MongoDB approximately.

For a certain database test, you can see some files like test.0, test.1, ..., so DATABASE = [FILE, ...]
FILE = [EXTENT, ...]
EXTENT = [RECORD, ...]
RECORD = HEADER + DOCUMENT + PADDING
HEADER = SIZE + OFFSET + PREV_RECORD_POINTER + NEXT_RECORD_POINTER + FLAG + ...

This link for your reference

Now I try to answer some of your questions as possile as I can.

How does fragmentation come to paly?
It happens when the current record is not enough to store the updated document, then produce a migration with behaviors of storing the updated document into a new enough space and delete the original record. The deleted record turns out a fragment.
Will it result in fragmented memory and potentially slower queries?
Fragmented memory will occur. But it won't cause slower queries unless not enough memory to allocate eventually.

However, the deleted record can be reused if the new coming document can fit into it. Below is a simple solid proof.
(Pay attention to the filed offset)

> db.a.insert([{_id:1},{_id:2},{_id:3}]);
BulkWriteResult({
        "writeErrors" : [ ],
        "writeConcernErrors" : [ ],
        "nInserted" : 3,
        "nUpserted" : 0,
        "nMatched" : 0,
        "nModified" : 0,
        "nRemoved" : 0,
        "upserted" : [ ]
})
> db.a.find()
{ "_id" : 1 }
{ "_id" : 2 }
{ "_id" : 3 }
> db.a.find().showDiskLoc()
{ "_id" : 1, "$diskLoc" : { "file" : 0, "offset" : 106672 } }
{ "_id" : 2, "$diskLoc" : { "file" : 0, "offset" : 106736 } }   // the following operation will delete this document
{ "_id" : 3, "$diskLoc" : { "file" : 0, "offset" : 106800 } }
> db.a.update({_id:2},{$set:{arr:[1,2,3]}});
WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })
> db.a.find().showDiskLoc()
{ "_id" : 1, "$diskLoc" : { "file" : 0, "offset" : 106672 } }
{ "_id" : 3, "$diskLoc" : { "file" : 0, "offset" : 106800 } }
{ "_id" : 2, "arr" : [ 1, 2, 3 ], "$diskLoc" : { "file" : 0, "offset" : 106864 } }  // migration happened
> db.a.insert({_id:4});
WriteResult({ "nInserted" : 1 })
> db.a.find().showDiskLoc()
{ "_id" : 1, "$diskLoc" : { "file" : 0, "offset" : 106672 } }
{ "_id" : 3, "$diskLoc" : { "file" : 0, "offset" : 106800 } }
{ "_id" : 2, "arr" : [ 1, 2, 3 ], "$diskLoc" : { "file" : 0, "offset" : 106864 } }
{ "_id" : 4, "$diskLoc" : { "file" : 0, "offset" : 106736 } }   // this space was taken up by {_id:2}, reused now.
>

回答2:

In additional you should read this article from Asya Kamsky. It helps you to make a desicion. http://askasya.com/post/largeembeddedarrays

The most obvious problem with this is eventually you'll hit the 16MB document limit, but that's not at all what you should be concerned about. A document that continuously grows will incur higher and higher cost every time it has to get relocated on disk, and even if you take steps to mitigate the effects of fragmentation, your writes will overall be unnecessarily long, impacting overall performance of your entire application.

来源：https://stackoverflow.com/questions/26029099/fragmentation-in-mongodb-when-growing-documents

标签

mongodb

data-modeling

denormalization

nosql