how to load data faster with talend and sql server

前端 未结 8 1707
傲寒
傲寒 2020-12-29 14:54

I use Talend to load data into a sql-server database.

It appears that the weakest point of my job is not the dataprocessing, but the effective load in my database, w

相关标签:
8条回答
  • 2020-12-29 15:36

    I think that @ydaetskcoR 's answer is perfect on a teorical point of view (divide rows that need Insert from those to Update) and gives you a working ETL solution useful for small dataset (some thousands rows).

    Performing the lookup to be able to decide wheter a row has to be updated or not is costly in ETL as all the data is going back and forth between the Talend machine and the DB server.

    When you get to some hundred of thousands o even millions of records you have to pass from ETL to ELT: you just load your data to some temp (staging) table as suggested from @Balazs Gunics and then you use SQL to manipulate it.

    In this case after loading your data (only INSERT = fast, even faster using BULK LOAD components) you will issue a LEFT OUTER JOIN between the temp table and the destination one to divide the rows that are already there (need update) and the others.

    This query will give you the rows you need to insert:

    SELECT staging.* FROM staging
    LEFT OUTER JOIN destination ON (destination.PK = staging.PK)
    WHERE destination.PK IS NULL
    

    This other one the rows you need to update:

    SELECT staging.* FROM staging
    LEFT OUTER JOIN destination ON (destination.PK = staging.PK)
    WHERE destination.PK IS   NOT    NULL
    

    This will be orders of magnitude faster than ETL, BUT you will need to use SQL to operate on your data, while in ETL you can use Java as ALL the data is taken to the Talend server, so often is common a first step on the local machine to pre-process the data in java (to clean and validate it) and then fire it up on the DB where you use join to load it in the right way.

    Here are the ELT JOB screen shots. INSERT or UPDATE ELT job

    How to distinguish between rows to insert or update

    0 讨论(0)
  • 2020-12-29 15:36

    You should do a staging table, where you insert the rows.

    Based on this staging table you do a DELETE query with t*SQLrow.

    DELETE FROM target_table
    WHERE target_table.id IN (SELECT id FROM staging_table);
    

    So the rows you wanted to update are no longer exists.

    INSERT INTO target_table 
    SELECT * FROM staging_table;
    

    This will move all the new/modified rows.

    0 讨论(0)
提交回复
热议问题