Apache NiFi: How to compare multiple rows in a csv and create new column

≯℡__Kan透↙ 提交于 2019-12-23 19:17:57

问题


I have a csv which looks like this.

Jc,TXF,timer,alpha,beta
15,44,55,12,33
18,87,33,111
9,87,61,29,77

Alpha and Beta combined makes up a city code. I want to add the name of the city to the csv as a new column.

Jc,TXF,timer,alpha,beta,city
15,44,55,12,33,York
18,87,33,111,London
9,87,61,29,77,Sydney

I have another csv with only the columns alpha,beta,city. Which looks like this:

alpha,beta,city
12,33,York
33,111,London
29,77,Sydney

How can I achieve this using Apache NiFi. Please suggest the processors and workflow needed to be used to achieve this.


回答1:


I see two ways of solving this.

First by using CsvLookupService. However the CsvLookupService only supports a single key, but you have two, alpha and beta. So to use this solution you have to concatenate both keys into a single key, like 12_33.

Second by using ExecuteScript processor. This one is better, because you don't have to modify your source data. Strategy:

  1. Split the CSV text into lines
  2. Enrich each line with the city column by looking up the alpha and beta keys in the mapping file
  3. Merge the individual lines into a single CSV file.

Overall flow:

GenerateFlowFile:

SplitText:

Set header line count to 1 to include the header line in the split content. For the ExecuteScript processor set python as scripting engine and provide following script body:

from org.apache.commons.io import IOUtils
from java.nio.charset import StandardCharsets
from org.apache.nifi.processor.io import StreamCallback
import csv

# Define a subclass of StreamCallback for use in session.write()
class PyStreamCallback(StreamCallback):
    def __init__(self):
        pass
    def process(self, inputStream, outputStream):
        # fetch the mapping CSV file
        with open('/home/nifi/mapping.csv', 'r') as mapping:
            # read the mapping file
            mappingContent = csv.reader(mapping, delimiter=',')
            # flowfile content is CSV text with two lines, header and actual content
            # split by newline to get access to each inidvidual line
            lines = IOUtils.toString(inputStream, StandardCharsets.UTF_8).split('\n')
            # the result will contain the header line 
            # the result will have the additional city column
            result = lines[0] + ',city\n'
            # take the second line and split it
            # to get access to alpha, beta and city values
            lineSplit = lines[1].split(',')

            # Go through the mapping file
            # item[0] -> alpha
            # item[1] -> beta
            # item[2] -> city
            # See if you find alpha and beta on the line content
            for item in mappingContent:
                if item[0] == lineSplit[3] and item[1] == lineSplit[4]:
                    result += lines[1] + ',' + item[2]
                    break

            if result is None:
                raise Exception('No matching found.')
            else:
                outputStream.write(bytearray(result.encode('utf-8')))
# end class

flowFile = session.get()
if(flowFile != None):
    try:
        flowFile = session.write(flowFile, PyStreamCallback())
        session.transfer(flowFile, REL_SUCCESS)
    except Exception as e:
        session.transfer(flowFile, REL_FAILURE)

See comments for a detailed description of the script. /home/nifi/mapping.csv has to be available on your NiFi instance. If you want to learn more about the ExecuteScript processor, refer to the ExecuteScript Cookbook. Finally you merge all the lines into a single CSV file:

Set CSV reader and writer. Leave their default properties. Adjust MergeContent properties to control how many lines should be in each resulting CSV file. Result:



来源:https://stackoverflow.com/questions/58620599/apache-nifi-how-to-compare-multiple-rows-in-a-csv-and-create-new-column

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!