How to have lzo compression in hadoop mapreduce?

邮差的信 提交于 2019-12-07 05:32:25

LZO's licence (GPL) is incompatible with that of Hadoop (Apache) and therefore it cannot be bundled with it. One needs to install LZO separately on the cluster.

The following steps are tested on Cloudera's Demo VM (CentOS 6.2, x64) that comes with full stack of CDH 4.2.0 and CM Free Edition installed, but they should work on any Linux based on Red Hat.

The installation consists of the following steps:

  • Installing LZO

    sudo yum install lzop

    sudo yum install lzo-devel

  • Installing ANT

    sudo yum install ant ant-nodeps ant-junit java-devel

  • Downloading the source

    git clone https://github.com/twitter/hadoop-lzo.git

  • Compiling Hadoop-LZO

    ant compile-native tar

    For further instructions and troubleshooting see https://github.com/twitter/hadoop-lzo

  • Copying Hapoop-LZO jar to Hadoop libs

    sudo cp build/hadoop-lzo*.jar /usr/lib/hadoop/lib/

  • Moving native code to Hadoop native libs

    sudo mv build/hadoop-lzo-0.4.17-SNAPSHOT/lib/native/Linux-amd64-64/ /usr/lib/hadoop/lib/native/

    cp /usr/lib/hadoop/lib/native/Linux-amd64-64/libgplcompression.* /usr/lib/hadoop/lib/native/

    Correct version number with the version you cloned

  • When working with a real cluster (as opposed to a pseudo-cluster) you need to rsync these to the rest of the machines

    rsync /usr/lib/hadoop/lib/ to all hosts.

    You can dry run this first with -n

  • Login to Cloudera Manager

  • Select from Services: mapreduce1->Configuration

  • Client->Compression

  • Add to Compression Codecs:

    com.hadoop.compression.lzo.LzoCodec

    com.hadoop.compression.lzo.LzopCodec

  • Search "valve"

  • Add to MapReduce Service Configuration Safety Valve

    io.compression.codec.lzo.class=com.hadoop.compression.lzo.LzoCodec mapred.child.env="JAVA_LIBRARY_PATH=/usr/lib/hadoop/lib/native/Linux-amd64-64/"

  • Add to MapReduce Service Environment Safety Valve

    HADOOP_CLASSPATH=/usr/lib/hadoop/lib/*

That's it.

Your MarReduce jobs that use TextInputFormat should work seamlessly with .lzo files. However, if you choose to index the LZO files to make them splittable (using com.hadoop.compression.lzo.DistributedLzoIndexer), you will find that the indexer writes a .index file next to each .lzo file. This is a problem because your TextInputFormat will interpret these as part of the input. In this case you need to change your MR jobs to work with LzoTextInputFormat.

As of Hive, as long as you don't index the LZO files, the change is also transparent. If you start indexing (to take advantage of a better data distribution) you will need to update the input format to LzoTextInputFormat. If you use partitions, you can do it per partition.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!