Joining 100 tables

前端 未结 3 1955
慢半拍i
慢半拍i 2020-12-15 10:10

Assume that I have a main table which has 100 columns referencing (as foreign keys) to some 100 tables (containing primary keys).

The whole pack of information requi

相关标签:
3条回答
  • 2020-12-15 10:33

    Why do you think joining 100 tables would be a performance issue?

    If all the keys are primary keys, then all the joins will use indexes. The only question, then, is whether the indexes fit into memory. If they fit in memory, performance is probably not an issue at all.

    You should try the query with the 100 joins before making such a statement.

    Furthermore, based on the original question, the reference tables have just a few values in them. The tables themselves fit on a single page, plus another page for the index. This is 200 pages, which would occupy at most a few megabytes of your page cache. Don't worry about the optimizations, create the view, and if you have performance problems then think about the next steps. Don't presuppose performance problems.

    ELABORATION:

    This has received a lot of comments. Let me explain why this idea may not be as crazy as it sounds.

    First, I am assuming that all the joins are done through primary key indexes, and that the indexes fit into memory.

    The 100 keys on the page occupy 400 bytes. Let's say that the original strings are, on average 40 bytes each. These would have occupied 4,000 bytes on the page, so we have a savings. In fact, about 2 records would fit on a page in the previous scheme. About 20 fit on a page with the keys.

    So, to read the records with the keys is about 10 times faster in terms of I/O than reading the original records. With the assumptions about the small number of values, the indexes and original data fit into memory.

    How long does it take to read 20 records? The old way required reading 10 pages. With the keys, there is one page read and 100*20 index lookups (with perhaps an additional lookup to get the value). Depending on the system, the 2,000 index lookups may be faster -- even much faster -- than the additional 9 page I/Os. The point I want to make is that this is a reasonable situation. It may or may not happen on a particular system, but it is not way crazy.

    This is a bit oversimplified. SQL Server doesn't actually read pages one-at-a-time. I think they are read in groups of 4 (and there might be look-ahead reads when doing a full-table scan). On the flip side, though, in most cases, a table-scan query is going to be more I/O bound than processor bound, so there are spare processor cycles for looking up values in reference tables.

    In fact, using the keys could result in faster reading of the table than not using them, because spare processing cycles would be used for the lookups ("spare" in the sense that processing power is available when reading). In fact, the table with the keys might be small enough to fit into available cache, greatly improving performance of more complex queries.

    The actual performance depends on lots of factors, such as the length of the strings, the original table (is it larger than available cache?), the ability of the underlying hardware to do I/O reads and processing at the same time, and the dependence on the query optimizer to do the joins correctly.

    My original point was that assuming a priori that the 100 joins are a bad thing is not correct. The assumption needs to be tested, and using the keys might even give a boost to performance.

    0 讨论(0)
  • 2020-12-15 10:42

    The SQL Server optimizer does contain logic to remove redundant joins, but there are restrictions, and the joins have to be provably redundant. To summarize, a join can have four effects:

    1. It can add extra columns (from the joined table)
    2. It can add extra rows (the joined table may match a source row more than once)
    3. It can remove rows (the joined table may not have a match)
    4. It can introduce NULLs (for a RIGHT or FULL JOIN)

    To successfully remove a redundant join, the query (or view) must account for all four possibilities. When this is done, correctly, the effect can be astonishing. For example:

    USE AdventureWorks2012;
    GO
    CREATE VIEW dbo.ComplexView
    AS
        SELECT
            pc.ProductCategoryID, pc.Name AS CatName,
            ps.ProductSubcategoryID, ps.Name AS SubCatName,
            p.ProductID, p.Name AS ProductName,
            p.Color, p.ListPrice, p.ReorderPoint,
            pm.Name AS ModelName, pm.ModifiedDate
        FROM Production.ProductCategory AS pc
        FULL JOIN Production.ProductSubcategory AS ps ON
            ps.ProductCategoryID = pc.ProductCategoryID
        FULL JOIN Production.Product AS p ON
            p.ProductSubcategoryID = ps.ProductSubcategoryID
        FULL JOIN Production.ProductModel AS pm ON
            pm.ProductModelID = p.ProductModelID
    

    The optimizer can successfully simplify the following query:

    SELECT
        c.ProductID,
        c.ProductName
    FROM dbo.ComplexView AS c
    WHERE
        c.ProductName LIKE N'G%';
    

    To:

    Simplified plan

    Rob Farley wrote about these ideas in depth in the original MVP Deep Dives book, and there is a recording of him presenting on the topic at SQLBits.

    The main restrictions are that foreign key relationships must be based on a single key to contribute to the simplification process, and compilation time for the queries against such a view may become quite long particularly as the number of joins increases. It could be quite a challenge to write a 100-table view that gets all the semantics exactly correct. I would be inclined to find an alternative solution, perhaps using dynamic SQL.

    That said, the particular qualities of your denormalized table may mean the view is quite simple to assemble, requiring only enforced FOREIGN KEYs non-NULLable referenced columns, and appropriate UNIQUE constraints to make this solution work as you would hope, without the overhead of 100 physical join operators in the plan.

    Example

    Using ten tables rather than a hundred:

    -- Referenced tables
    CREATE TABLE dbo.Ref01 (col01 tinyint PRIMARY KEY, item varchar(50) NOT NULL UNIQUE);
    CREATE TABLE dbo.Ref02 (col02 tinyint PRIMARY KEY, item varchar(50) NOT NULL UNIQUE);
    CREATE TABLE dbo.Ref03 (col03 tinyint PRIMARY KEY, item varchar(50) NOT NULL UNIQUE);
    CREATE TABLE dbo.Ref04 (col04 tinyint PRIMARY KEY, item varchar(50) NOT NULL UNIQUE);
    CREATE TABLE dbo.Ref05 (col05 tinyint PRIMARY KEY, item varchar(50) NOT NULL UNIQUE);
    CREATE TABLE dbo.Ref06 (col06 tinyint PRIMARY KEY, item varchar(50) NOT NULL UNIQUE);
    CREATE TABLE dbo.Ref07 (col07 tinyint PRIMARY KEY, item varchar(50) NOT NULL UNIQUE);
    CREATE TABLE dbo.Ref08 (col08 tinyint PRIMARY KEY, item varchar(50) NOT NULL UNIQUE);
    CREATE TABLE dbo.Ref09 (col09 tinyint PRIMARY KEY, item varchar(50) NOT NULL UNIQUE);
    CREATE TABLE dbo.Ref10 (col10 tinyint PRIMARY KEY, item varchar(50) NOT NULL UNIQUE);
    

    The parent table definition (with page-compression):

    CREATE TABLE dbo.Normalized
    (
        pk      integer IDENTITY NOT NULL,
        col01   tinyint NOT NULL REFERENCES dbo.Ref01,
        col02   tinyint NOT NULL REFERENCES dbo.Ref02,
        col03   tinyint NOT NULL REFERENCES dbo.Ref03,
        col04   tinyint NOT NULL REFERENCES dbo.Ref04,
        col05   tinyint NOT NULL REFERENCES dbo.Ref05,
        col06   tinyint NOT NULL REFERENCES dbo.Ref06,
        col07   tinyint NOT NULL REFERENCES dbo.Ref07,
        col08   tinyint NOT NULL REFERENCES dbo.Ref08,
        col09   tinyint NOT NULL REFERENCES dbo.Ref09,
        col10   tinyint NOT NULL REFERENCES dbo.Ref10,
    
        CONSTRAINT PK_Normalized
            PRIMARY KEY CLUSTERED (pk)
            WITH (DATA_COMPRESSION = PAGE)
    );
    

    The view:

    CREATE VIEW dbo.Denormalized
    WITH SCHEMABINDING AS
    SELECT
        item01 = r01.item,
        item02 = r02.item,
        item03 = r03.item,
        item04 = r04.item,
        item05 = r05.item,
        item06 = r06.item,
        item07 = r07.item,
        item08 = r08.item,
        item09 = r09.item,
        item10 = r10.item
    FROM dbo.Normalized AS n
    JOIN dbo.Ref01 AS r01 ON r01.col01 = n.col01
    JOIN dbo.Ref02 AS r02 ON r02.col02 = n.col02
    JOIN dbo.Ref03 AS r03 ON r03.col03 = n.col03
    JOIN dbo.Ref04 AS r04 ON r04.col04 = n.col04
    JOIN dbo.Ref05 AS r05 ON r05.col05 = n.col05
    JOIN dbo.Ref06 AS r06 ON r06.col06 = n.col06
    JOIN dbo.Ref07 AS r07 ON r07.col07 = n.col07
    JOIN dbo.Ref08 AS r08 ON r08.col08 = n.col08
    JOIN dbo.Ref09 AS r09 ON r09.col09 = n.col09
    JOIN dbo.Ref10 AS r10 ON r10.col10 = n.col10;
    

    Hack the statistics to make the optimizer think the table is very large:

    UPDATE STATISTICS dbo.Normalized WITH ROWCOUNT = 100000000, PAGECOUNT = 5000000;
    

    Example user query:

    SELECT
        d.item06,
        d.item07
    FROM dbo.Denormalized AS d
    WHERE
        d.item08 = 'Banana'
        AND d.item01 = 'Green';
    

    Gives us this execution plan:

    Execution plan 1

    The scan of the Normalized table looks bad, but both Bloom-filter bitmaps are applied during the scan by the storage engine (so rows that cannot match do not even surface as far as the query processor). This may be enough to give acceptable performance in your case, and certainly better than scanning the original table with its overflowing columns.

    If you are able to upgrade to SQL Server 2012 Enterprise at some stage, you have another option: creating a column-store index on the Normalized table:

    CREATE NONCLUSTERED COLUMNSTORE INDEX cs 
    ON dbo.Normalized (col01,col02,col03,col04,col05,col06,col07,col08,col09,col10);
    

    The execution plan is:

    Columnstore Plan

    That probably looks worse to you, but column storage provides exceptional compression, and the whole execution plan runs in Batch Mode with filters for all the contributing columns. If the server has adequate threads and memory available, this alternative could really fly.

    Ultimately, I'm not sure this normalization is the correct approach considering the number of tables and the chances of getting a poor execution plan or requiring excessive compilation time. I would probably correct the schema of the denormalized table first (proper data types and so on), possibly apply data compression...the usual things.

    If the data truly belongs in a star-schema, it probably needs more design work than just splitting off repeating data elements into separate tables.

    0 讨论(0)
  • 2020-12-15 10:52

    If your data doesn't change much, you may benefit from creating an Indexed View, which basically materializes the view.

    If the data changes often, it may not be a good option, as the server has to maintain the indexed view for each change in the underlying tables of the view.

    Here's a good blog post that describes it a bit better.

    From the blog:

    CREATE VIEW dbo.vw_SalesByProduct_Indexed
     WITH SCHEMABINDING
     AS
          SELECT 
                Product, 
                COUNT_BIG(*) AS ProductCount, 
                SUM(ISNULL(SalePrice,0)) AS TotalSales
          FROM dbo.SalesHistory
          GROUP BY Product
     GO
    

    The script below creates the index on our view:

    CREATE UNIQUE CLUSTERED INDEX idx_SalesView ON vw_SalesByProduct_Indexed(Product)
    

    To show that an index has been created on the view and that it does take up space in the database, run the following script to find out how many rows are in the clustered index and how much space the view takes up.

    EXECUTE sp_spaceused 'vw_SalesByProduct_Indexed'
    

    The SELECT statement below is the same statement as before, except this time it performs a clustered index seek, which is typically very fast.

    SELECT 
          Product, TotalSales, ProductCount 
     FROM vw_SalesByProduct_Indexed
     WHERE Product = 'Computer'
    
    0 讨论(0)
提交回复
热议问题