How to run external program within mapper or reducer giving HDFS files as input and storing output files in HDFS?

问题

I have a external program which take file as a input and give output file

     //for example 
     input file: IN_FILE
     output file: OUT_FILE

    //Run External program 
     ./vx < ${IN_FILE} > ${OUT_FILE}

I want both input and output files in HDFS

I have cluster with 8 nodes.And I have 8 input files each have 1 line

    //1 input file :       1.txt 
           1:0,0,0
    //2 input file :       2.txt 
           2:0,0,128
    //3 input file :       3.txt 
           3:0,128,0
    //5 input file :       4.txt 
           4:0,128,128
    //5 input file :       5.txt 
           5:128,0,0
    //6 input file :       6.txt 
           6:128,0,128
    //7 input file :       7.txt 
           7:128,128,0
    //8 input file :       8.txt 
           8:128,128,128

I am using KeyValueTextInputFormat

               key :file name
               value: initial coordinates

For example 5th file

              key :5
              value:128,0,0

each map tasks generate huge amount of data according to their initial coordinates.

Now I want to run external program in each map task and generate output file.

But I am confuse how to do that with files in HDFS .

         I can use zero reducer and create file in HDFS 

         Configuration conf = new Configuration();
         FileSystem fs = FileSystem.get(conf);
         Path outFile;
         outFile = new Path(INPUT_FILE_NAME);
         FSDataOutputStream out = fs.create(outFile);

         //generating data ........ and writing to HDFS 
          out.writeUTF(lon + ";" + lat + ";" + depth + ";");

I am confuse how to run external program with HDFS file without getting file into file into local directory .

  with  dfs -get

Without using MR I am getting results with shell script as following

#!/bin/bash

if [ $# -lt 2 ]; then
    printf "Usage: %s: <infile> <outfile> \n" $(basename $0) >&2
          exit 1
fi

IN_FILE=/Users/x34/data/$1
OUT_FILE=/Users/x34/data/$2                     

cd "/Users/x34/Projects/externalprogram/model/"

./vx < ${IN_FILE} > ${OUT_FILE}

paste ${IN_FILE} ${OUT_FILE} | awk '{print $1,"\t",$2,"\t",$3,"\t",$4,"\t",$5,"\t",$22,"\t",$23,"\t",$24}' > /Users/x34/data/combined
if [ $? -ne 0 ]; then
    exit 1
fi                      

exit 0

And then I run it with

         ProcessBuilder pb = new ProcessBuilder("SHELL_SCRIPT","in", "out"); 
         Process p = pb.start();

I would much appreciate any idea how to use hadoop streaming or any other way to run external program .I want both INPUT and OUTPUT files in HDFS for further processing .

Please help

回答1:

So assuming that your external program doesnt know how to recognize or read from hdfs, then what you will want to do is load in the file from java and pass it as input directly to the program

Path path = new Path("hdfs/path/to/input/file");
FileSystem fs = FileSystem.get(configuration);
FSDataInputStream fin = fs.open(path);
ProcessBuilder pb = new ProcessBuilder("SHELL_SCRIPT");
Process p = pb.start();
OutputStream os = p.getOutputStream();
BufferedReader br = new BufferedReader(new InputStreamReader(fin));
BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(os));

String line = null;
while ((line = br.readLine())!=null){
    writer.write(line);
}

The output can be done in the reverse manner. Get the InputStream from the process, and make a FSDataOutputStream to write to the hdfs.

Essentially your program with these two things becomes an adapter that converts HDFS to input and output back into HDFS.

回答2:

You could emply Hadoop Streaming for that:

$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper myPythonScript.py \
-reducer /bin/wc \
-file myPythonScript.py \
-file myDictionary.txt

See https://hadoop.apache.org/docs/r1.0.4/streaming.pdf for some examples.

Also a nice article : http://princetonits.com/blog/technology/hadoop-mapreduce-streaming-using-bash-script/

Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer.

Another example:

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
    -input myInputDirs \
    -output myOutputDir \
    -mapper /bin/cat \
    -reducer /bin/wc

In the above example, both the mapper and the reducer are executables that read the input from stdin (line by line) and emit the output to stdout. The utility will create a Map/Reduce job, submit the job to an appropriate cluster, and monitor the progress of the job until it completes.

When an executable is specified for mappers, each mapper task will launch the executable as a separate process when the mapper is initialized. As the mapper task runs, it converts its inputs into lines and feed the lines to the stdin of the process. In the meantime, the mapper collects the line oriented outputs from the stdout of the process and converts each line into a key/value pair, which is collected as the output of the mapper. By default, the prefix of a line up to the first tab character is the key and the rest of the line (excluding the tab character) will be the value. If there is no tab character in the line, then entire line is considered as key and the value is null. However, this can be customized, as discussed later.

When an executable is specified for reducers, each reducer task will launch the executable as a separate process then the reducer is initialized. As the reducer task runs, it converts its input key/values pairs into lines and feeds the lines to the stdin of the process. In the meantime, the reducer collects the line oriented outputs from the stdout of the process, converts each line into a key/value pair, which is collected as the output of the reducer. By default, the prefix of a line up to the first tab character is the key and the rest of the line (excluding the tab character) is the value. However, this can be customized.

来源：https://stackoverflow.com/questions/16345438/how-to-run-external-program-within-mapper-or-reducer-giving-hdfs-files-as-input

标签

Hadoop

MapReduce