Grep across multiple files in Hadoop Filesystem

前端 未结 5 804
时光说笑
时光说笑 2020-12-30 02:01

I am working with Hadoop and I need to find which of ~100 files in my Hadoop filesystem contain a certain string.

I can see the files I wish to search like this:

5条回答
  •  长发绾君心
    2020-12-30 02:31

    Using hadoop fs -cat (or the more generic hadoop fs -text) might be feasible if you just have two 1 GB files. For 100 files though I would use the streaming-api because it can be used for adhoc-queries without resorting to a full fledged mapreduce job. E.g. in your case create a script get_filename_for_pattern.sh:

    #!/bin/bash
    grep -q $1 && echo $mapreduce_map_input_file
    cat >/dev/null # ignore the rest
    

    Note that you have to read the whole input, in order to avoid getting java.io.IOException: Stream closed exceptions.

    Then issue the commands

    hadoop jar $HADOOP_HOME/hadoop-streaming.jar\
     -Dstream.non.zero.exit.is.failure=false\
     -files get_filename_for_pattern.sh\
     -numReduceTasks 1\
     -mapper "get_filename_for_pattern.sh bcd4bc3e1380a56108f486a4fffbc8dc"\
     -reducer "uniq"\
     -input /apps/hdmi-technology/b_dps/real-time/*\
     -output /tmp/files_matching_bcd4bc3e1380a56108f486a4fffbc8dc
    hadoop fs -cat /tmp/files_matching_bcd4bc3e1380a56108f486a4fffbc8dc/*
    

    In newer distributions mapred streaming instead of hadoop jar $HADOOP_HOME/hadoop-streaming.jar should work. In the latter case you have to set your $HADOOP_HOME correctly in order to find the jar (or provide the full path directly).

    For simpler queries you don't even need a script but just can provide the command to the -mapper parameter directly. But for anything slightly complex it's preferable to use a script, because getting the escaping right can be a chore.

    If you don't need a reduce phase provide the symbolic NONE parameter to the respective -reduce option (or just use -numReduceTasks 0). But in your case it's useful to have a reduce phase in order to have the output consolidated into a single file.

提交回复
热议问题