Write to multiple outputs by key Spark - one Spark job

后端 未结 10 1794
挽巷
挽巷 2020-11-22 05:08

How can you write to multiple outputs dependent on the key using Spark in a single Job.

Related: Write to multiple outputs by key Scalding Hadoop, one MapReduce Job<

10条回答
  •  日久生厌
    2020-11-22 06:05

    I was in need of the same thing in Java. Posting my translation of Zhang Zhan's Scala answer to Spark Java API users:

    import org.apache.hadoop.mapred.lib.MultipleTextOutputFormat;
    import org.apache.spark.SparkConf;
    import org.apache.spark.api.java.JavaSparkContext;
    import scala.Tuple2;
    
    import java.util.Arrays;
    
    
    class RDDMultipleTextOutputFormat extends MultipleTextOutputFormat {
    
        @Override
        protected String generateFileNameForKeyValue(A key, B value, String name) {
            return key.toString();
        }
    }
    
    public class Main {
    
        public static void main(String[] args) {
            SparkConf conf = new SparkConf()
                    .setAppName("Split Job")
                    .setMaster("local");
            JavaSparkContext sc = new JavaSparkContext(conf);
            String[] strings = {"Abcd", "Azlksd", "whhd", "wasc", "aDxa"};
            sc.parallelize(Arrays.asList(strings))
                    // The first character of the string is the key
                    .mapToPair(s -> new Tuple2<>(s.substring(0,1).toLowerCase(), s))
                    .saveAsHadoopFile("output/", String.class, String.class,
                            RDDMultipleTextOutputFormat.class);
            sc.stop();
        }
    }
    

提交回复
热议问题