I recently set up LZO compression in Hadoop. What is the easiest way to compress a file in HDFS? I want to compress a file and then delete the original. Should I create a
@Chitra I cannot comment due to reputation issue
Here is everything in one command: Instead of using the second command, you can reduce into one compressed file directly
hadoop jar share/hadoop/tools/lib/hadoop-streaming-2.7.3.jar \
-Dmapred.reduce.tasks=1 \
-Dmapred.output.compress=true \
-Dmapred.compress.map.output=true \
-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.BZip2Codec \
-input /input/raw_file \
-output /archives/ \
-mapper /bin/cat \
-reducer /bin/cat \
-inputformat org.apache.hadoop.mapred.TextInputFormat \
-outputformat org.apache.hadoop.mapred.TextOutputFormat
Thus, you gain a lot of space by having only one compress file
For example, let's say i have 4 files of 10MB (it's plain text, JSON formatted)
The map only is giving me 4 files of 650 KB If I map and reduce I have 1 file of 1.05 MB