Can Flink write results into multiple files (like Hadoop's MultipleOutputFormat)?

落爺英雄遲暮 提交于 2020-01-29 09:42:29

问题


I'm using Apache Flink's DataSet API. I want to implement a job that writes multiple results into different files.

How can I do that?


回答1:


You can add as many data sinks to a DataSet program as you need.

For example in a program like this:

ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

DataSet<Tuple3<String, Long, Long>> data = env.readFromCsv(...);
// apply MapFunction and emit
data.map(new YourMapper()).writeToText("/foo/bar");
// apply FilterFunction and emit
data.filter(new YourFilter()).writeToCsv("/foo/bar2");

You read a DataSet data from a CSV file. This data is given to two subsequent transformations:

  1. To a MapFunction and its result is written to a text file.
  2. To a FilterFunction and the non-filtered tuples are written to a CSV file.

You can also have multiple data source and branch and merge data sets (using union, join, coGroup, cross, or broadcast sets) as you like.




回答2:


You can use HadoopOutputFormat API in Flink like this:

class IteblogMultipleTextOutputFormat[K, V] extends MultipleTextOutputFormat[K, V] {
override def generateActualKey(key: K, value: V): K =
  NullWritable.get().asInstanceOf[K]

override def generateFileNameForKeyValue(key: K, value: V, name: String): String =
  key.asInstanceOf[String]
}

and we can using IteblogMultipleTextOutputFormat as follow:

val multipleTextOutputFormat = new IteblogMultipleTextOutputFormat[String, String]()
val jc = new JobConf()
FileOutputFormat.setOutputPath(jc, new Path("hdfs:///user/iteblog/"))
val format = new HadoopOutputFormat[String, String](multipleTextOutputFormat,   jc)
val batch = env.fromCollection(List(("A", "1"), ("A", "2"), ("A", "3"),
  ("B", "1"), ("B", "2"), ("C", "1"), ("D", "2")))
batch.output(format)

for more information you can see:http://www.iteblog.com/archives/1667



来源:https://stackoverflow.com/questions/37067959/can-flink-write-results-into-multiple-files-like-hadoops-multipleoutputformat

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!