How to use CompressionCodec in Hadoop

问题

I am doing following to do compression of o/p files from reducer:

OutputStream out = ipFs.create( new Path( opDir + "/" + fileName ) );
CompressionCodec codec = new GzipCodec(); 
OutputStream cs = codec.createOutputStream( out );
BufferedWriter cout = new BufferedWriter( new OutputStreamWriter( cs ) );
cout.write( ... )

But got null pointer exception in line 3:

java.lang.NullPointerException
    at org.apache.hadoop.io.compress.zlib.ZlibFactory.isNativeZlibLoaded(ZlibFactory.java:63)
    at org.apache.hadoop.io.compress.GzipCodec.createOutputStream(GzipCodec.java:92)
    at myFile$myReduce.reduce(myFile.java:354)

I also got following JIRA for the same.

Can you please suggest if I am doing something wrong?

回答1:

You should use the CompressionCodecFactory if you want to use compression outside of the standard OutputFormat handling (as detailed in @linker answer):

CompressionCodecFactory ccf = new CompressionCodecFactory(conf)
CompressionCodec codec = ccf.getCodecByClassName(GzipCodec.class.getName());
OutputStream compressedOutputSream = codec.createOutputStream(outputStream)

回答2:

You're doing it wrong. The standard way to do this would be:

TextOutputFormat.setOutputCompressorClass(job, GzipCodec.class);

The GzipCodec is a Configurable, you have to initialize it properly if you instantiate it directly (setConf, ...)

Try this and let me know if that works.

来源：https://stackoverflow.com/questions/10155602/how-to-use-compressioncodec-in-hadoop

标签

java

Hadoop

compression

MapReduce