Pentaho Data Integration Import large dataset from DB

余生颓废 提交于 2021-02-10 20:30:26

问题


I'm trying to import a large set of data from one DB to another (MSSQL to MySQL). The transformation does this: gets a subset of data, check if it's an update or an insert by checking hash, map the data and insert it into MySQL DB with an API call. The subset part for the moment is strictly manual, is there a way to set Pentaho to do it for me, kind of iteration. The query I'm using to get the subset is

select t1.* 
from (
    select *, ROW_NUMBER() as RowNum over (order by id)
    from mytable
) t1 
where RowNum between @offset and @offset + @limit;

Is there a way that PDI can set the offset and reiterate the whole?

Thanks


回答1:


You can (despite the warnings) create a loop in a parent job, incrementing the offset variable each iteration in a Javascript step. I've used such a setup to consume webservices with an unknown number of results, shifting the offset each time I after get a full page and stopping when I get less.

Setting up the variables

In the job properties, define parameters Offset and Limit, so you can (re)start at any offset even invoke the job from the commandline with specific offset and limit. It can be done with a variables step too, but parameters do all the same things plus you can set defaults for testing.

Processing in the transformation

The main transformation(s) should have "pass parameter values to subtransformation" enabled, as it is by default.

Inside the transformation (see lower half of the image) you start with a Table Input that uses variable substitution, putting ${Offset} and ${Limit} where you have @offset and @limit.

The stream from Table Input then goes to processing, but also is copied to a Group By step for counting rows. Leave the group field empty and create a field that counts all rows. Check the box to always give back a result row.

Send the stream from Group By to a Set Variables step and set the NumRows variable in the scope of the parent job.

Looping back

In the main job, go from the transformations to a Simple Evaluation step to compare the NumRows variable to the Limit. If NumRows is smaller than ${Limit}, you've reached the last batch, success!

If not, proceed to a Javascript step to increment the Offset like this:

var offset = parseInt(parent_job.getVariable("Offset"),0);
var limit = parseInt(parent_job.getVariable("Limit"),0);
offset = offset + limit;
parent_job.setVariable("Offset",offset);
true;

The job flow then proceeds to the dummy step and then the transformation again, with the new offset value.

Notes

  • Unlike a transformation, you can set and use a variable within the same job.
  • The JS step needs "true;" as the last statement so it reports success to the job.



来源:https://stackoverflow.com/questions/58616643/pentaho-data-integration-import-large-dataset-from-db

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!