Talend: Equivalent of logstash “key value” filter

∥☆過路亽.° 提交于 2020-07-09 14:53:16

问题


I'm discovering Talend Open Source Data Integrator and I would like to transform my data file into a csv file.

My data are some sets of key value data like this example:

A=0 B=3 C=4
A=2 C=4
A=2 B=4
A= B=3 C=1

I want to transform it into a CSV like this one:

A,B,C
0,3,4
2,,4
2,4,

With Logstash, I was using the "key value" filter which is able to do this job with a few lines of code. But with Talend, I don't find a similar transformation. I tried a "delimiter file" job and some other jobs without success.


回答1:


Corentin's answer is excellent, but here's an enhanced version of it, which cuts down on some components:

Instead of using tFileInputRaw and tConvertType, I used tFileInputFullRow, which reads the file line by line into a string.
Instead of splitting the string manually (where you need to check for nulls), I used tExtractDelimitedFields with "=" as a separator in order to extract a key and a value from the "key=value" column.
The end result is the same, with an extra column at the beginning.
If you want to delete the column, a dirty hack would be to read the output file using a tFileInputFullRow, and use a regex like ^[^;]+; in a tReplace to replace anything up to (and including) the first ";" in the line with an empty string, and write the result to another file.




回答2:


This is quite tricky and interesting, because Talend is schema-based, so if you don't have the input/output schema predefined, it could be quite hard to achieve what you want.

Here is something you can try, there is a bunch of components to use, I didn't manage to get to a solution with fewer components. My solution is using unusual components like tNormalize and tPivotToColumnsDelimited. There is one flaw, as you'll get an extra column in the end.

1 - tFileInputRaw, because if you don't know your input schema, just read the file with this one.

2 - tConvertType : here you can convert Object to String type

3 - tNormalize : you'll have to separate manually your lines (use \n as separator)

4 - tMap : add a sequence "I"+Numeric.sequence("s1",1,1) , this will be used later to identify and regroup lines.

5 - tNormalize : here I normalize on 'TAB' separator, to get one line for each key=value pair

6 - tMap : you'll have to split on "=" sign.

At this step, you'll have an output like :

|seq|key|value|
|=--+---+----=|
|I1 |A  |1    |
|I1 |B  |2    |
|I1 |C  |3    |
|I2 |A  |2    |
|I2 |C  |4    |
|I3 |A  |2    |
|I3 |B  |4    |
'---+---+-----'

where seq is the line number.

7 - Finally, with the tPivotToColumnDelimited, you'll have the result. Unfortunately, you'll have the extra "ID" column, as the output schema provided by the component tPivot is not editable. (the component is creating the schema, actually, which is very unusual amongst the talend components). Use ID column as the regroup column.

Hope this helps, again, Talend is not a very easy tool if you have dynamic input/output schemas.



来源:https://stackoverflow.com/questions/57034306/talend-equivalent-of-logstash-key-value-filter

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!