how to load data faster with talend and sql server

前端未结

关注

 8  1706

傲寒

I use Talend to load data into a sql-server database.

It appears that the weakest point of my job is not the dataprocessing, but the effective load in my database, w

相关标签:

8条回答

忘掉有多难

2020-12-29 15:25

I've found where this performance problem come form.

I do an INSERT OR UPDATE, if I replace it with a simple INSERT, the speed goes up to 4000 rows/s.

Does it seem like an acceptable pace?

Anyway, I need my INSERT OR UPDATE so, I guess I'm stuck.

0 讨论(0)
发布评论:

提交评论
- 加载中...
广开言路

2020-12-29 15:26

I was having the same issue loading data into a DB2 server. I too had the commit set at 10000 but once I selected the option to batch(on the same component options screen) performance dramatically improved. When I moved the commit and batch to 20000 the job went from 5 hours to under 2 minutes.

0 讨论(0)
发布评论:

提交评论
- 加载中...
执笔经年

2020-12-29 15:27

I had the same problem and solved it by defining an index on target table.

Usually, the target table has an id field which is its primary key and hence indexed. So, all sort of joins with it would work just fine. But the update from a flat file is done by some data fields, so each update statement have to make full table scan.

The above also explains why it works fast with INSERT and becomes slow with INSERT OR UPDATE

0 讨论(0)
发布评论:

提交评论
- 加载中...
说谎

2020-12-29 15:30

Based on your note that inserts are an order of magnitude faster than updates (4000 vs 17/sec) - It looks like you need to look at your DB indexes. Adding an index that matches your update parameters could speedup your updates significantly. Of course, this index may slow your inserts a bit.

You can also look at the query execution plan for your update query to see if it is using any indexes. How do I obtain a Query Execution Plan?

0 讨论(0)
发布评论:

提交评论
- 加载中...
[愿得一人]

2020-12-29 15:30
Recommend a couple simple things:
1. Where possible / reasonable, change from ETL to ELT.
2. Set up a CDC process and only handle the changes. Depending on the database and needs, this can be handled (a) directly on the database, (b) through automated Talend functionality (need a subscription), (c) manually via SQL (full outer join) and a custom Java function that generates an MD5 hash, or (d) manually via SQL (full outer join) and the tAddCRCRow component.
3. Where possible, load multiple tables concurrently.
4. Where possible, use bulk loading for tables.
5. Sometimes, a clear and load is acceptable as an approach and faster than checking for updates.
0 讨论(0)
发布评论:

提交评论
- 加载中...
星月不相逢

2020-12-29 15:31

Database INSERT OR UPDATE methods are incredibly costly as the database cannot batch all of the commits to do all at once and must do them line by line (ACID transactions force this because if it attempted to do an insert and then failed then all of the other records in this commit would also fail).

Instead, for large bulk operations it is always best to predetermine whether a record would be inserted or updated before passing the commit to the database and then sending 2 transactions to the database.

A typical job that needed this functionality would assemble the data that is to be INSERT OR UPDATEd and then query the database table for the existing primary keys. If the primary key already exists then you can send this as an UPDATE, otherwise it is an INSERT. The logic for this can be easily done in a tMap component.

In this job we have some data that we wish to INSERT OR UPDATE into a database table that contains some pre-existing data:

And we wish to add the following data to it:

The job works by throwing the new data into a tHashOutput component so it can be used multiple times in the same job (it simply puts it to memory or in large instances can cache it to the disk).

Following on from this one lot of data is read out of a tHashInput component and directly into a tMap. Another tHashInput component is utilised to run a parameterised query against the table:

You may find this guide to Talend and parameterised queries useful. From here the returned records (so only the ones inside the database already) are used as a lookup to the tMap.

This is then configured as an INNER JOIN to find the records that need to be UPDATED with the rejects from the INNER JOIN to be inserted:

These outputs then just flow to separate tMySQLOutput components to UPDATE or INSERT as necessary. And finally when the main subjob is complete we commit the changes.

0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页