How to join two CSVs with Apache Nifi

时光怂恿深爱的人放手 提交于 2019-11-29 07:51:09

Apache NiFi is more of a dataflow tool and not really made to perform arbitrary joins of streaming data. Typically those types of operations are better suited to stream processing systems like Storm, Flink, Apex, etc, or ETL tools.

The types of joins that NiFi can do well are enrichment look ups where there is a fixed size lookup dataset, and for each record in the incoming data you use the lookup dataset to retrieve some value. For example, in your case there could be a processor called LookUpState which has a property "State Data" which points to a file containing all the states, then the customers.csv could be the input to this processor.

A community member started a project to make a generic lookup service for NiFi: https://github.com/jfrazee/nifi-lookup-service

Joe Witt

The typical pattern one follows for this is to load the reference set into a map cache controller service in NiFi. In this case that is the states.csv data. Then the live feed of customer data comes in and is enriched with this reference data using something like ReplaceText or you could even write a custom processor in Groovy. There are a lot of ways to slice this. There is also a JIRA/PR coming for making this even easier. There are elements of live stream joins that are best done in processing systems like Apache Storm, Spark, and Flink, but for the case you mention it can be done well in NiFi.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!