Hadoop: compress file in HDFS?

后端 未结 7 1858
逝去的感伤
逝去的感伤 2020-11-27 18:23

I recently set up LZO compression in Hadoop. What is the easiest way to compress a file in HDFS? I want to compress a file and then delete the original. Should I create a

7条回答
  •  粉色の甜心
    2020-11-27 18:43

    I know this is old thread, but if anyone following this thread (like me) it would be useful to know that any of following 2 methods gives you a tab (\t) character at the end of each line

     hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-0.20.2-cdh3u2.jar \
          -Dmapred.output.compress=true \
          -Dmapred.compress.map.output=true \
          -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
          -Dmapred.reduce.tasks=0 \
          -input  \
          -output $OUTPUT \
          -mapper "cut -f 2"
    
    
    hadoop jar share/hadoop/tools/lib/hadoop-streaming-2.7.3.jar \
            -Dmapred.reduce.tasks=1 \
            -Dmapred.output.compress=true \
            -Dmapred.compress.map.output=true \
            -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.BZip2Codec \
            -input /input/raw_file \
            -output /archives/ \
            -mapper /bin/cat \
            -reducer /bin/cat \
            -inputformat org.apache.hadoop.mapred.TextInputFormat \
            -outputformat org.apache.hadoop.mapred.TextOutputFormat
    

    From this hadoop-streaming.jar adds x'09' at the end of each line, I found the fix and we need to set following 2 parameters to respecitve delimiter you use (in my case it was ,)

     -Dstream.map.output.field.separator=, \
     -Dmapred.textoutputformat.separator=, \
    

    full command to execute

    hadoop jar /jars/hadoop-streaming-2.6.0-cdh5.4.11.jar \
            -Dmapred.reduce.tasks=1 \
            -Dmapred.output.compress=true \
            -Dmapred.compress.map.output=true \
     -Dstream.map.output.field.separator=, \
     -Dmapred.textoutputformat.separator=, \
            -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.Lz4Codec \
            -input file:////home/admin.kopparapu/accenture/File1_PII_Phone_part3.csv \
            -output file:///home/admin.kopparapu/accenture/part3 \
     -mapper /bin/cat \
            -reducer /bin/cat \
            -inputformat org.apache.hadoop.mapred.TextInputFormat \
            -outputformat org.apache.hadoop.mapred.TextOutputFormat
    

提交回复
热议问题