Multiple Output Files for Hadoop Streaming with Python Mapper

后端 未结 1 1328
臣服心动
臣服心动 2020-12-09 21:58

I am looking for a little clarification on the the answers to this question here:

Generating Separate Output files in Hadoop Streaming

My use case is as foll

相关标签:
1条回答
  • 2020-12-09 22:21

    You can do something like the following, but it involves a little Java compiling, which I think shouldn't be a problem, if you want your use case done anyway with Python- From Python, as far as I know it's not directly possible to skip the filename from the final output as your use case demands in a single job. But what's shown below can make it possible with ease!

    Here is the Java class that's need to compiled -

    package com.custom;
     import org.apache.hadoop.fs.Path;
     import org.apache.hadoop.io.Text;
     import org.apache.hadoop.mapred.lib.MultipleTextOutputFormat;
    
     public class CustomMultiOutputFormat extends MultipleTextOutputFormat<Text, Text> {
            /**
            * Use they key as part of the path for the final output file.
            */
           @Override
           protected String generateFileNameForKeyValue(Text key, Text value, String leaf) {
                 return new Path(key.toString(), leaf).toString();
           }
    
           /**
            * We discard the key as per your requirement
            */
           @Override
           protected Text generateActualKey(Text key, Text value) {
                 return null;
           }
     }
    

    Steps to compile:

    1. Save the text to a file exactly (no different name) CustomMultiOutputFormat.java
    2. While you are in the directory where the above saved file is, type -

      $JAVA_HOME/bin/javac -cp $(hadoop classpath) -d . CustomMultiOutputFormat.java

    3. Make sure JAVA_HOME is set to /path/to/your/SUNJDK before attempting the above command.

    4. Make your custom.jar file using (type exactly) -

      $JAVA_HOME/bin/jar cvf custom.jar com/custom/CustomMultiOutputFormat.class

    5. Finally, run your job like -

      hadoop jar /path/to/your/hadoop-streaming-*.jar -libjars custom.jar -outputformat com.custom.CustomMultiOutputFormat -file your_script.py -input inputpath --numReduceTasks 0 -output outputpath -mapper your_script.py

    After doing these you should see two directories inside your outputpath one with valid_file_name and other with err_file_name. All records having valid_file_name as a tag will go to valid_file_name directory and all records having err_file_name would go to err_file_name directory.

    I hope all these makes sense.

    0 讨论(0)
提交回复
热议问题