问题
I have a csv which looks like this.
Jc,TXF,timer,alpha,beta
15,44,55,12,33
18,87,33,111
9,87,61,29,77
Alpha and Beta combined makes up a city code. I want to add the name of the city to the csv as a new column.
Jc,TXF,timer,alpha,beta,city
15,44,55,12,33,York
18,87,33,111,London
9,87,61,29,77,Sydney
I have another csv with only the columns alpha,beta,city
. Which looks like this:
alpha,beta,city
12,33,York
33,111,London
29,77,Sydney
How can I achieve this using Apache NiFi. Please suggest the processors and workflow needed to be used to achieve this.
回答1:
I see two ways of solving this.
First by using CsvLookupService
. However the CsvLookupService
only supports a single key, but you have two, alpha and beta. So to use this solution you have to concatenate both keys into a single key, like 12_33.
Second by using ExecuteScript
processor. This one is better, because you don't have to modify your source data. Strategy:
- Split the CSV text into lines
- Enrich each line with the city column by looking up the alpha and beta keys in the mapping file
- Merge the individual lines into a single CSV file.
Overall flow:
GenerateFlowFile:
SplitText:
Set header line count
to 1 to include the header line in the split content. For the ExecuteScript
processor set python as scripting engine
and provide following script body
:
from org.apache.commons.io import IOUtils
from java.nio.charset import StandardCharsets
from org.apache.nifi.processor.io import StreamCallback
import csv
# Define a subclass of StreamCallback for use in session.write()
class PyStreamCallback(StreamCallback):
def __init__(self):
pass
def process(self, inputStream, outputStream):
# fetch the mapping CSV file
with open('/home/nifi/mapping.csv', 'r') as mapping:
# read the mapping file
mappingContent = csv.reader(mapping, delimiter=',')
# flowfile content is CSV text with two lines, header and actual content
# split by newline to get access to each inidvidual line
lines = IOUtils.toString(inputStream, StandardCharsets.UTF_8).split('\n')
# the result will contain the header line
# the result will have the additional city column
result = lines[0] + ',city\n'
# take the second line and split it
# to get access to alpha, beta and city values
lineSplit = lines[1].split(',')
# Go through the mapping file
# item[0] -> alpha
# item[1] -> beta
# item[2] -> city
# See if you find alpha and beta on the line content
for item in mappingContent:
if item[0] == lineSplit[3] and item[1] == lineSplit[4]:
result += lines[1] + ',' + item[2]
break
if result is None:
raise Exception('No matching found.')
else:
outputStream.write(bytearray(result.encode('utf-8')))
# end class
flowFile = session.get()
if(flowFile != None):
try:
flowFile = session.write(flowFile, PyStreamCallback())
session.transfer(flowFile, REL_SUCCESS)
except Exception as e:
session.transfer(flowFile, REL_FAILURE)
See comments for a detailed description of the script. /home/nifi/mapping.csv
has to be available on your NiFi instance. If you want to learn more about the ExecuteScript
processor, refer to the ExecuteScript Cookbook. Finally you merge all the lines into a single CSV file:
Set CSV reader and writer. Leave their default properties. Adjust MergeContent
properties to control how many lines should be in each resulting CSV file. Result:
来源:https://stackoverflow.com/questions/58620599/apache-nifi-how-to-compare-multiple-rows-in-a-csv-and-create-new-column