Write to multiple outputs by key Spark - one Spark job

后端 未结 10 1779
挽巷
挽巷 2020-11-22 05:08

How can you write to multiple outputs dependent on the key using Spark in a single Job.

Related: Write to multiple outputs by key Scalding Hadoop, one MapReduce Job<

10条回答
  •  渐次进展
    2020-11-22 06:11

    saveAsText() and saveAsHadoop(...) are implemented based on the RDD data, specifically by the method: PairRDD.saveAsHadoopDataset which takes the data from the PairRdd where it's executed. I see two possible options: If your data is relatively small in size, you could save some implementation time by grouping over the RDD, creating a new RDD from each collection and using that RDD to write the data. Something like this:

    val byKey = dataRDD.groupByKey().collect()
    val rddByKey = byKey.map{case (k,v) => k->sc.makeRDD(v.toSeq)}
    val rddByKey.foreach{ case (k,rdd) => rdd.saveAsText(prefix+k}
    

    Note that it will not work for large datasets b/c the materialization of the iterator at v.toSeq might not fit in memory.

    The other option I see, and actually the one I'd recommend in this case is: roll your own, by directly calling the hadoop/hdfs api.

    Here's a discussion I started while researching this question: How to create RDDs from another RDD?

提交回复
热议问题