问题
Currently, when I STORE into HDFS, it creates many part files.
Is there any way to store out to a single CSV file?
回答1:
You can do this in a few ways:
To set the number of reducers for all Pig opeations, you can use the
default_parallelproperty - but this means every single step will use a single reducer, decreasing throughput:set default_parallel 1;Prior to calling STORE, if one of the operations execute is (COGROUP, CROSS, DISTINCT, GROUP, JOIN (inner), JOIN (outer), and ORDER BY), then you can use the
PARALLEL 1keyword to denote the use of a single reducer to complete that command:GROUP a BY grp PARALLEL 1;
See Pig Cookbook - Parallel Features for more information
回答2:
You can also use Hadoop's getmerge command to merge all those part-* files. This is only possible if you run your Pig scripts from the Pig shell (and not from Java).
This as an advantage over the proposed solution: as you can still use several reducers to process your data, so your job may run faster, especially if each reducer output few data.
grunt> fs -getmerge <Pig output file> <local file>
来源:https://stackoverflow.com/questions/9910908/store-output-to-a-single-csv