Why are joins bad when considering scalability?

后端 未结 16 2952
猫巷女王i
猫巷女王i 2020-12-12 11:32

Why are joins bad or \'slow\'. I know i heard this more then once. I found this quote

The problem is joins are relatively slow, especially over very

相关标签:
16条回答
  • 2020-12-12 12:03

    Joins are considered an opposing force to scalability because they're typically the bottleneck and they cannot be easily distributed or paralleled.

    0 讨论(0)
  • 2020-12-12 12:04

    article says that they are slow when compared to absence of joins. this can be achieved with denormalization. so there is a trade off between speed and normalization. don't forget about premature optimization also :)

    0 讨论(0)
  • 2020-12-12 12:05

    People with terrabyte sized databases still use joins, if they can get them to work performance-wise then so can you.

    There are many reasons not to denomalize. First, speed of select queries is not the only or even main concern with databases. Integrity of the data is the first concern. If you denormalize then you have to put into place techniques to keep the data denormalized as the parent data changes. So suppose you take to storing the client name in all tables instead of joining to the client table on the client_Id. Now when the name of the client changes (100% chance some of the names of clients will change over time), now you need to update all the child records to reflect that change. If you do this wil a cascade update and you have a million child records, how fast do you suppose that is going to be and how many users are going to suffer locking issues and delays in their work while it happens? Further most people who denormalize because "joins are slow" don't know enough about databases to properly make sure their data integrity is protected and often end up with databases that have unuseable data becasue the integrity is so bad.

    Denormalization is a complex process that requires an thorough understanding of database performance and integrity if it is to be done correctly. Do not attempt to denormalize unless you have such expertise on staff.

    Joins are quite fast enough if you do several things. First use a suggorgate key, an int join is almost alawys the fastest join. Second always index the foreign key. Use derived tables or join conditions to create a smaller dataset to filter on. If you have a large very complex database, then hire a professional database person with experience in partioning and managing huge databases. There are plenty of techniques to improve performance without getting rid of joins.

    If you just need query capability, then yes you can design a datawarehouse which can be denormalized and is populated through an ETL tool (optimized for speed) not user data entry.

    0 讨论(0)
  • 2020-12-12 12:11

    Also from the article you cited:

    Many mega-scale websites with billions of records, petabytes of data, many thousands of simultaneous users, and millions of queries a day are doing is using a sharding scheme and some are even advocating denormalization as the best strategy for architecting the data tier.

    and

    And unless you are a really large website you probably don't need to worry about this level of complexity.

    and

    It's more error prone than having the database do all this work, but you are able to do scale past what even the highest end databases can handle.

    The article is discussing mega-sites like Ebay. At that level of usage you are likely going to have to consider something other than plain vanilla relational database management. But in the "normal" course of business (applications with thousands of users and millions of records) those more expensive, more error prone approaches are overkill.

    0 讨论(0)
  • 2020-12-12 12:12

    The joins can be slow if large portions of records from each side need to be scanned.

    Like this:

    SELECT  SUM(transaction)
    FROM    customers
    JOIN    accounts
    ON      account_customer = customer_id
    

    Even if an index is defined on account_customer, all records from the latter still need to be scanned.

    For the query list this, the decent optimizers won't probably even consider the index access path, doing a HASH JOIN or a MERGE JOIN instead.

    Note that for a query like this:

    SELECT  SUM(transaction)
    FROM    customers
    JOIN    accounts
    ON      account_customer = customer_id
    WHERE   customer_last_name = 'Stellphlug'
    

    the join will most probably will be fast: first, an index on customer_last_name will be used to filter all Stellphlug's (which are of course, not very numerous), then an index scan on account_customer will be issued for each Stellphlug to find his transactions.

    Despite the fact that these can be billions of records in accounts and customers, only few will actually need to be scanned.

    0 讨论(0)
  • 2020-12-12 12:13

    While joins (presumably due to a normalized design) can obviously be slower for data retrieval than a read from a single table, a denormalized database can be slow for data creation/update operations since the footprint of the overall transaction will not be minimal.

    In a normalized database, a piece of data will live in only one place, so the footprint for an update will be as minimal as possible. In a denormalized database, it's possible that the same column in multiple rows or across tables will have to be updated, meaning the footprint would be larger and chance of locks and deadlocks can increase.

    0 讨论(0)
提交回复
热议问题