Grep across multiple files in Hadoop Filesystem

前端 未结 5 823
时光说笑
时光说笑 2020-12-30 02:01

I am working with Hadoop and I need to find which of ~100 files in my Hadoop filesystem contain a certain string.

I can see the files I wish to search like this:

5条回答
  •  渐次进展
    2020-12-30 02:22

    This is a hadoop "filesystem", not a POSIX one, so try this:

    hadoop fs -ls /apps/hdmi-technology/b_dps/real-time | awk '{print $8}' | \
    while read f
    do
      hadoop fs -cat $f | grep -q bcd4bc3e1380a56108f486a4fffbc8dc && echo $f
    done
    

    This should work, but it is serial and so may be slow. If your cluster can take the heat, we can parallelize:

    hadoop fs -ls /apps/hdmi-technology/b_dps/real-time | awk '{print $8}' | \
      xargs -n 1 -I ^ -P 10 bash -c \
      "hadoop fs -cat ^ | grep -q bcd4bc3e1380a56108f486a4fffbc8dc && echo ^"
    

    Notice the -P 10 option to xargs: this is how many files we will download and search in parallel. Start low and increase the number until you saturate disk I/O or network bandwidth, whatever is relevant in your configuration.

    EDIT: Given that you're on SunOS (which is slightly brain-dead) try this:

    hadoop fs -ls /apps/hdmi-technology/b_dps/real-time | awk '{print $8}' | while read f; do hadoop fs -cat $f | grep bcd4bc3e1380a56108f486a4fffbc8dc >/dev/null && echo $f; done
    

提交回复
热议问题