I\'m trying to play around with the Google ngrams dataset using Amazon\'s Elastic Map Reduce. There\'s a public dataset at http://aws.amazon.com/datasets/8172056142375670, a
lzo is packaged as part of elastic mapreduce so there's no need to install anything.
i just tried this and it works...
hadoop jar ~hadoop/contrib/streaming/hadoop-streaming.jar \ -D mapred.reduce.tasks=0 \ -input s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-all/1gram/ \ -inputformat SequenceFileAsTextInputFormat \ -output test_output \ -mapper org.apache.hadoop.mapred.lib.IdentityMapper