Background
I\'m prototyping a conversion from our RDBMS database to MongoDB. While denormalizing, it seems as if I have two choices, one which leads to
Documents that grow substantially over time can be ticking time bombs. Network bandwidth and RAM usage will likely become measurable bottlenecks, forcing you to start over.
First, let's consider two collections: Customer and Payment. Thus, the grain is fairly small: one document per payment.
Next you must decide how to model account information, such as credit cards. Let's consider whether customer documents contain arrays of account information or whether you need a new Account collection.
If account documents are separate from customer documents, loading all of the accounts for one customer into memory requires fetching multiple documents. That might translate into extra memory, I/O, bandwidth, and CPU usage. Does that immediately mean the Account collection is a bad idea?
Your decision affects payment documents. If account information is embedded in a customer document, how would you reference it? Separate account documents have their own _id attribute. With embedded account information, your application would either generate new ids for accounts or use the account's attributes (e.g., account number) for the key.
Could a payment document actually contain all the payments made in fixed timeframe (e.g., day?). Such complexity will affect all code that reads and writes payment documents. Premature optimization can be deadly to projects.
Like account documents, payments are easily referenced as long as a payment document contains only one payment. A new type of document, credit for example, could reference a payment. But would you create a Credit collection or would you embed credit information inside payment information? What would happen if you later needed to reference a credit?
To summarize, I have been successful with lots of small documents and many collections. I implement references with _id and only with _id. Thus, I don't worry about ever-growing documents destroying my application. The schema is easy to understand and index because each entity has its own collection. Important entities aren't hiding inside other documents.
I'd love to hear about your findings. Good luck!