Write to multiple outputs by key Spark - one Spark job

后端 未结 10 1840
挽巷
挽巷 2020-11-22 05:08

How can you write to multiple outputs dependent on the key using Spark in a single Job.

Related: Write to multiple outputs by key Scalding Hadoop, one MapReduce Job<

10条回答
  •  忘掉有多难
    2020-11-22 05:53

    I had a similar use case. I resolved it in Java by writing two custom classes implemeting MultipleTextOutputFormat and RecordWriter.

    My input was a JavaPairRDD> and I wanted to store it in a file named by its key, with all the lines contained in its value.

    Here is the code for my MultipleTextOutputFormat implementation

    class RDDMultipleTextOutputFormat extends MultipleTextOutputFormat {
    
        @Override
        protected String generateFileNameForKeyValue(K key, V value, String name) {
            return key.toString(); //The return will be used as file name
        }
    
        /** The following 4 functions are only for visibility purposes                 
        (they are used in the class MyRecordWriter) **/
        protected String generateLeafFileName(String name) {
            return super.generateLeafFileName(name);
        }
    
        protected V generateActualValue(K key, V value) {
            return super.generateActualValue(key, value);
        }
    
        protected String getInputFileBasedOutputFileName(JobConf job,     String name) {
            return super.getInputFileBasedOutputFileName(job, name);
            }
    
        protected RecordWriter getBaseRecordWriter(FileSystem fs, JobConf job, String name, Progressable arg3) throws IOException {
            return super.getBaseRecordWriter(fs, job, name, arg3);
        }
    
        /** Use my custom RecordWriter **/
        @Override
        RecordWriter getRecordWriter(final FileSystem fs, final JobConf job, String name, final Progressable arg3) throws IOException {
        final String myName = this.generateLeafFileName(name);
            return new MyRecordWriter(this, fs, job, arg3, myName);
        }
    } 
    

    Here is the code for my RecordWriter implementation.

    class MyRecordWriter implements RecordWriter {
    
        private RDDMultipleTextOutputFormat rddMultipleTextOutputFormat;
        private final FileSystem fs;
        private final JobConf job;
        private final Progressable arg3;
        private String myName;
    
        TreeMap> recordWriters = new TreeMap();
    
        MyRecordWriter(RDDMultipleTextOutputFormat rddMultipleTextOutputFormat, FileSystem fs, JobConf job, Progressable arg3, String myName) {
            this.rddMultipleTextOutputFormat = rddMultipleTextOutputFormat;
            this.fs = fs;
            this.job = job;
            this.arg3 = arg3;
            this.myName = myName;
        }
    
        @Override
        void write(K key, V value) throws IOException {
            String keyBasedPath = rddMultipleTextOutputFormat.generateFileNameForKeyValue(key, value, myName);
            String finalPath = rddMultipleTextOutputFormat.getInputFileBasedOutputFileName(job, keyBasedPath);
            Object actualValue = rddMultipleTextOutputFormat.generateActualValue(key, value);
            RecordWriter rw = this.recordWriters.get(finalPath);
            if(rw == null) {
                rw = rddMultipleTextOutputFormat.getBaseRecordWriter(fs, job, finalPath, arg3);
                this.recordWriters.put(finalPath, rw);
            }
            List lines = (List) actualValue;
            for (String line : lines) {
                rw.write(null, line);
            }
        }
    
        @Override
        void close(Reporter reporter) throws IOException {
            Iterator keys = this.recordWriters.keySet().iterator();
    
            while(keys.hasNext()) {
                RecordWriter rw = (RecordWriter)this.recordWriters.get(keys.next());
                rw.close(reporter);
            }
    
            this.recordWriters.clear();
        }
    }
    

    Most of the code is exactly the same than in FileOutputFormat. The only difference is those few lines

    List lines = (List) actualValue;
    for (String line : lines) {
        rw.write(null, line);
    }
    

    These lines allowed me to write each line of my input List on the file. The first argument of the write function is set to null in order to avoid writting the key on each line.

    To finish, I only need to do this call to write my files

    javaPairRDD.saveAsHadoopFile(path, String.class, List.class, RDDMultipleTextOutputFormat.class);
    

提交回复
热议问题