Write to multiple outputs by key Spark - one Spark job

后端 未结 10 1795
挽巷
挽巷 2020-11-22 05:08

How can you write to multiple outputs dependent on the key using Spark in a single Job.

Related: Write to multiple outputs by key Scalding Hadoop, one MapReduce Job<

10条回答
  •  Happy的楠姐
    2020-11-22 06:05

    I would do it like this which is scalable

    import org.apache.hadoop.io.NullWritable
    
    import org.apache.spark._
    import org.apache.spark.SparkContext._
    
    import org.apache.hadoop.mapred.lib.MultipleTextOutputFormat
    
    class RDDMultipleTextOutputFormat extends MultipleTextOutputFormat[Any, Any] {
      override def generateActualKey(key: Any, value: Any): Any = 
        NullWritable.get()
    
      override def generateFileNameForKeyValue(key: Any, value: Any, name: String): String = 
        key.asInstanceOf[String]
    }
    
    object Split {
      def main(args: Array[String]) {
        val conf = new SparkConf().setAppName("Split" + args(1))
        val sc = new SparkContext(conf)
        sc.textFile("input/path")
        .map(a => (k, v)) // Your own implementation
        .partitionBy(new HashPartitioner(num))
        .saveAsHadoopFile("output/path", classOf[String], classOf[String],
          classOf[RDDMultipleTextOutputFormat])
        spark.stop()
      }
    }
    

    Just saw similar answer above, but actually we don't need customized partitions. The MultipleTextOutputFormat will create file for each key. It is ok that multiple record with same keys fall into the same partition.

    new HashPartitioner(num), where the num is the partition number you want. In case you have a big number of different keys, you can set number to big. In this case, each partition will not open too many hdfs file handlers.

提交回复
热议问题