Normalize or Denormalize in high traffic websites

问题

What are the best practices for database design and normalization for high traffic websites like stackoverflow?

Should one use a normalized database for record keeping or a normalized technique or a combination of both?

Is it sensible to design a normalized database as the main database for record keeping to reduce redundancy and at the same time maintain another denormalized form of the database for fast searching?

Should the main database be denormalized but with normalized views at the application level for fast database operations?

or some other approach?

回答1:

The performance hit of joining is frequently overestimated. Database products like Oracle are built to join very efficiently. Joins are often regarded as performing badly when the real culprit is a poor data model or a poor indexing strategy. People also forget that denormalised databases perform very badly when it comes to inserting or updating data.

The key thing to bear in mind is the type of application you're building. Most of the famous websites are not like regular enterprise applications. That's why Google, Facebook, etc don't use relational databases. There's been a lot of discussion of this topic recently, which I have blogged about.

So if you're building a website which is primarily about delivering shedloads of semi-structured content you probably don't want to be using a relational database, denormalised or otherwise. But if you're building a highly transactional website (such as an online bank) you need a design which guarantees data security and integrity, and does so well. That means a relational database in at least third normal form.

回答2:

Denormalizing the db to reduce the number of joins needed for intense queries is one of many different ways of scaling. Having to do fewer joins means less heavy lifting by the db, and disk is cheap.

That said, for ridiculous amounts of traffic good relational db performance can be hard to achieve. That is why many bigger sites use key value stores(e.g. memcached) and other caching mechanisms.

The Art of Capacity Planning is pretty good.

回答3:

You can listen to a discussion on this very topic by the creators of stack overflow on thier podcast at:
http://itc.conversationsnetwork.org/shows/detail3993.html

回答4:

First: Define for yourself what hight-traffic means:

50.000 Page-Viewss per day?
500.000 Page-Views per day?
5.000.000 Page-Views per day?
more?

Then calculate this down to probable peak page-views per minute and per seconds. After that think about the data you want to query per page-view. Is the data cacheable? How dynamic is the data, how big is the data?

Analyze your individual requirements, program some code, do some load-testing, optimize. In most cases, before you need to scale out the database servers you need to scale out the web-servers.

Relational-database can be, if fully optimized, amazingly fast, when joining tables!

A relational-database could be hit seldom when to as a back-end, to populate a cache or fill some denormalized data tables. I would not make denormalization the default approach.

(You mentioned search, look into e.g. lucene or something similar, if you need full-text search.)

The best best-practice answer is definitely: It depends ;-)

回答5:

For a project I'm working on, we've gone for the denormalized table route as we expect our major tables to have a high ratio of writes to reads (instead of all users hitting the same tables, we've denormalized them and set each "user set" to use a particular shard). You may find read http://highscalability.com/ for examples of how the "big sites" cope with the volume - Stack Overflow was recently featured.

回答6:

Neither matters if you aren't caching properly.

来源：https://stackoverflow.com/questions/1217262/normalize-or-denormalize-in-high-traffic-websites

标签

performance

database-design

database-normalization

high-availability

denormalization