With Apache Pig how to select and store columns from a CSV according to header line

让人想犯罪 __ 提交于 2020-01-15 09:14:11

问题


I have many CSV files, all with a header line. The files all look similar:

name, gender, preference, ....
peter, m, soap, ...
paul, m, gel, ...
mary, f, soap, ...
.
.
.

But column positions and exact header names can be a bit different, eg. another file could look like:

"the preferences", "the name", "the gender",....
soap, peter, m, ...
gel, paul, m, ...
soap, mary, f, ...
.
.
.

I want to output/store only the columns for which the header contains the word "name". The psotion of this column I do not know in advance, because each file can be different.

So, I need to associate the columns in each file with their header names. Can I do this in Pig?

I though of using two FILTER operators (one for the header, one for the data), but does the data for this not have to be read twice?


回答1:


It would probably be easier to do this in streaming or in a storage function.

See the implementation of CSVExcelStorage and SKIP_INPUT_HEADER - http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java

You could read the header of the file, find the location of the "name" field and then only return the field in that location for all the other records in the file.

You should make sure that each split is a single file because if a file is split between tasks the tasks that work on the parts of the file that don't contain the header wouldn't be able to detect the "name" field.



来源:https://stackoverflow.com/questions/18053048/with-apache-pig-how-to-select-and-store-columns-from-a-csv-according-to-header-l

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!