With Apache Pig how to select and store columns from a CSV according to header line

问题

I have many CSV files, all with a header line. The files all look similar:

name, gender, preference, ....
peter, m, soap, ...
paul, m, gel, ...
mary, f, soap, ...
.
.
.

But column positions and exact header names can be a bit different, eg. another file could look like:

"the preferences", "the name", "the gender",....
soap, peter, m, ...
gel, paul, m, ...
soap, mary, f, ...
.
.
.

I want to output/store only the columns for which the header contains the word "name". The psotion of this column I do not know in advance, because each file can be different.

So, I need to associate the columns in each file with their header names. Can I do this in Pig?

I though of using two FILTER operators (one for the header, one for the data), but does the data for this not have to be read twice?

回答1:

It would probably be easier to do this in streaming or in a storage function.

See the implementation of CSVExcelStorage and SKIP_INPUT_HEADER - http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java

You could read the header of the file, find the location of the "name" field and then only return the field in that location for all the other records in the file.

You should make sure that each split is a single file because if a file is split between tasks the tasks that work on the parts of the file that don't contain the header wouldn't be able to detect the "name" field.

来源：https://stackoverflow.com/questions/18053048/with-apache-pig-how-to-select-and-store-columns-from-a-csv-according-to-header-l

标签

performance

header

apache-pig