How to have lzo compression in hadoop mapreduce?

I want to use lzo to compress map output but I can't run it! The version of Hadoop I used is 0.20.2. I set:

conf.set("mapred.compress.map.output", "true") 
conf.set("mapred.map.output.compression.codec",
"org.apache.hadoop.io.compress.LzoCodec");

When I run the jar file in Hadoop it shows an exception that can't write map output.

Do I have to install lzo? What do I have to do to use lzo?

LZO's licence (GPL) is incompatible with that of Hadoop (Apache) and therefore it cannot be bundled with it. One needs to install LZO separately on the cluster.

The following steps are tested on Cloudera's Demo VM (CentOS 6.2, x64) that comes with full stack of CDH 4.2.0 and CM Free Edition installed, but they should work on any Linux based on Red Hat.

The installation consists of the following steps:

Installing LZO

sudo yum install lzop

sudo yum install lzo-devel
Installing ANT

sudo yum install ant ant-nodeps ant-junit java-devel
Downloading the source

git clone https://github.com/twitter/hadoop-lzo.git
Compiling Hadoop-LZO

ant compile-native tar

For further instructions and troubleshooting see https://github.com/twitter/hadoop-lzo
Copying Hapoop-LZO jar to Hadoop libs

sudo cp build/hadoop-lzo*.jar /usr/lib/hadoop/lib/
Moving native code to Hadoop native libs

sudo mv build/hadoop-lzo-0.4.17-SNAPSHOT/lib/native/Linux-amd64-64/ /usr/lib/hadoop/lib/native/

cp /usr/lib/hadoop/lib/native/Linux-amd64-64/libgplcompression.* /usr/lib/hadoop/lib/native/

Correct version number with the version you cloned
When working with a real cluster (as opposed to a pseudo-cluster) you need to rsync these to the rest of the machines

rsync /usr/lib/hadoop/lib/ to all hosts.

You can dry run this first with -n
Login to Cloudera Manager
Select from Services: mapreduce1->Configuration
Client->Compression
Add to Compression Codecs:

com.hadoop.compression.lzo.LzoCodec

com.hadoop.compression.lzo.LzopCodec
Search "valve"
Add to MapReduce Service Configuration Safety Valve

io.compression.codec.lzo.class=com.hadoop.compression.lzo.LzoCodec mapred.child.env="JAVA_LIBRARY_PATH=/usr/lib/hadoop/lib/native/Linux-amd64-64/"
Add to MapReduce Service Environment Safety Valve

HADOOP_CLASSPATH=/usr/lib/hadoop/lib/*

That's it.

Your MarReduce jobs that use TextInputFormat should work seamlessly with .lzo files. However, if you choose to index the LZO files to make them splittable (using com.hadoop.compression.lzo.DistributedLzoIndexer), you will find that the indexer writes a .index file next to each .lzo file. This is a problem because your TextInputFormat will interpret these as part of the input. In this case you need to change your MR jobs to work with LzoTextInputFormat.

As of Hive, as long as you don't index the LZO files, the change is also transparent. If you start indexing (to take advantage of a better data distribution) you will need to update the input format to LzoTextInputFormat. If you use partitions, you can do it per partition.

来源：https://stackoverflow.com/questions/14808041/how-to-have-lzo-compression-in-hadoop-mapreduce

标签

Hadoop

MapReduce