how to load data faster with talend and sql server

前端 未结 8 1736
傲寒
傲寒 2020-12-29 14:54

I use Talend to load data into a sql-server database.

It appears that the weakest point of my job is not the dataprocessing, but the effective load in my database, w

8条回答
  •  星月不相逢
    2020-12-29 15:31

    Database INSERT OR UPDATE methods are incredibly costly as the database cannot batch all of the commits to do all at once and must do them line by line (ACID transactions force this because if it attempted to do an insert and then failed then all of the other records in this commit would also fail).

    Instead, for large bulk operations it is always best to predetermine whether a record would be inserted or updated before passing the commit to the database and then sending 2 transactions to the database.

    A typical job that needed this functionality would assemble the data that is to be INSERT OR UPDATEd and then query the database table for the existing primary keys. If the primary key already exists then you can send this as an UPDATE, otherwise it is an INSERT. The logic for this can be easily done in a tMap component.

    Insert or Update Job Example

    In this job we have some data that we wish to INSERT OR UPDATE into a database table that contains some pre-existing data:

    Initially loaded data

    And we wish to add the following data to it:

    Insert or Update data

    The job works by throwing the new data into a tHashOutput component so it can be used multiple times in the same job (it simply puts it to memory or in large instances can cache it to the disk).

    Following on from this one lot of data is read out of a tHashInput component and directly into a tMap. Another tHashInput component is utilised to run a parameterised query against the table:

    Parameterised Query Parameter Config

    You may find this guide to Talend and parameterised queries useful. From here the returned records (so only the ones inside the database already) are used as a lookup to the tMap.

    This is then configured as an INNER JOIN to find the records that need to be UPDATED with the rejects from the INNER JOIN to be inserted:

    tMap configuration

    These outputs then just flow to separate tMySQLOutput components to UPDATE or INSERT as necessary. And finally when the main subjob is complete we commit the changes.

提交回复
热议问题