Why are joins bad or \'slow\'. I know i heard this more then once. I found this quote
The problem is joins are relatively slow, especially over very
Well, yeah, selecting rows from one denormalized table (assuming decent indexes for your query) might be faster that selecting rows constructed from joining several tables, particularly if the joins don't have efficient indexes available.
The examples cited in the article - Flickr and eBay - are exceptional cases IMO, so have (and deserve) exceptional responses. The author specifically calls out the lack of RI and the extent of data duplication in the article.
Most applications - again, IMO - benefit from the validation & reduced duplication provided by RDBMSs.
Joins can be slower than avoiding them through de-normalisation but if used correctly (joining on columns with appropriate indexes an so on) they are not inherently slow.
De-normalisation is one of many optimisation techniques you can consider if your well designed database schema exhibits performance problems.
The amount of temporary data that is generated could be huge based on the joins.
For an example, one database here at work had a generic search function where all of the fields were optional. The search routine did a join on every table before the search began. This worked well in the beginning. But, now that the main table has over 10 million rows... not so much. Searches now take 30 minutes or more.
I was tasked with optimizing the search stored procedure.
The first thing I did was if any of the fields of the main table were being searched, I did a select to a temp table on those fields only. THEN, I joined all the tables with that temp table before doing the rest of the search. Searches where one of the main table fields now take less than 10 seconds.
If none of the main table fields are begin searched, I do similar optimizations for other tables. When I was done, no search takes longer than 30 seconds with most under 10.
CPU utilization of the SQL server also went WAY DOWN.
Joins are fast.
Joins should be considered standard practice with a properly normalized database schema. Joins allow you to join disparate groups of data in a meaningful way. Don't fear the join.
The caveat is that you must understand normalization, joining, and the proper use of indexes.
Beware premature optimization, as the number one failing of all development projects is meeting the deadline. Once you've completed the project, and you understand the trade offs, you can break the rules if you can justify it.
It's true that join performance degrades non-linearly as the size of the data set increases. Therefore, it doesn't scale as nicely as single table queries, but it still does scale.
It's also true that a bird flies faster without any wings, but only straight down.
Scalability is all about pre-computing (caching), spreading out, or paring down the repeated work to the bare essentials, in order to minimize resource use per work unit. To scale well, you don't do anything you don't need to in volume, and the things you actually do you make sure are done as efficiently as possible.
In that context, of course joining two separate data sources is relatively slow, at least compared to not joining them, because it's work you need to do live at the point where the user requests it.
But remember the alternative is no longer having two separate pieces of data at all; you have to put the two disparate data points in the same record. You can't combine two different pieces of data without a consequence somewhere, so make sure you understand the trade-off.
The good news is modern relational databases are good at joins. You shouldn't really think of joins as slow with a good database used well. There are a number of scalability-friendly ways to take raw joins and make them much faster:
I would go as far as saying the main reason relational databases exist at all is to allow you do joins efficiently*. It's certainly not just to store structured data (you could do that with flat file constructs like csv or xml). A few of the options I listed will even let you completely build your join in advance, so the results are already done before you issue the query — just as if you had denormalized the data (admittedly at the cost of slower write operations).
If you have a slow join, you're probably not using your database correctly.
De-normalization should be done only after these other techniques have failed. And the only way you can truly judge "failure" is to set meaningful performance goals and measure against those goals. If you haven't measured, it's too soon to even think about de-normalization.
* That is, exist as entities distinct from mere collections of tables. An additional reason for a real rdbms is safe concurrent access.
Joins do require extra processing since they have to look in more files and more indexes to "join" the data together. However, "very large data sets" is all relative. What is the definition of large? I the case of JOINs, I think its a reference to a large result set, not that overall dataset.
Most databases can very quickly process a query that selects 5 records from a primary table and joins 5 records from a related table for each record (assuming the correct indexes are in place). These tables can have hundreds of millions of records each, or even billions.
Once your result set starts growing, things are going to slow down. Using the same example, if the primary table results in 100K records, then there will be 500K "joined" records that need to be found. Just pulling that much data out of the database with add delays.
Don't avoid JOINs, just know you may need to optimize/denormalize when datasets get "very large".