发表新帖

发表新帖

Write to multiple outputs by key Spark - one Spark job

后端未结

关注

 10  1779

挽巷 2020-11-22 05:08

How can you write to multiple outputs dependent on the key using Spark in a single Job.

Related: Write to multiple outputs by key Scalding Hadoop, one MapReduce Job<

10条回答

渐次进展 (楼主)

2020-11-22 06:11
saveAsText() and saveAsHadoop(...) are implemented based on the RDD data, specifically by the method: PairRDD.saveAsHadoopDataset which takes the data from the PairRdd where it's executed. I see two possible options: If your data is relatively small in size, you could save some implementation time by grouping over the RDD, creating a new RDD from each collection and using that RDD to write the data. Something like this:
```
val byKey = dataRDD.groupByKey().collect()
val rddByKey = byKey.map{case (k,v) => k->sc.makeRDD(v.toSeq)}
val rddByKey.foreach{ case (k,rdd) => rdd.saveAsText(prefix+k}
```
Note that it will not work for large datasets b/c the materialization of the iterator at v.toSeq might not fit in memory.

The other option I see, and actually the one I'd recommend in this case is: roll your own, by directly calling the hadoop/hdfs api.

Here's a discussion I started while researching this question: How to create RDDs from another RDD?
0 讨论(0)

查看其它10个回答
发布评论:

提交评论
- 加载中...

热议问题