Loading files into pig and decompressing them

大兔子大兔子 提交于 2019-12-13 04:54:09

问题


I am loading a bunch of files from Azure storage into pig. Pig has default support for gzip so if the file extensions are .gz everything works fine.

Problem is that older files are stored with .zip extension (I have millions of those).

Is there a way to tell pig to load files and treat .zip as gzip?


回答1:


I really don't know some other options are available but you can try something like this

  1. write a bash script which will convert the given zip file to gz file
  2. load the gz file in pig

Just a sample example for one file, you may need to change the script according to your need.

input.zip
1,john
2,cena
3,rock
4,sam

test.sh
#!/bin/bash
FILE_NAME=$(echo $1 | cut -d '.' -f1)
unzip  "$1"
tar czf "$FILE_NAME.gz" "$FILE_NAME"
pig -x local -param PIG_INPUT_FILE="$FILE_NAME.gz" -f myscript.pig

myscript.pig
A = LOAD '$PIG_INPUT_FILE' USING PigStorage(',');
DUMP A;

Output:

$ ./test.sh input.zip

(1,john)
(2,cena)
(3,rock)
(4,sam)

The other possible option is you may need to write a UDF to convert zip to gz using java.util.zip library and call LoadFunc option. I didn't try this option but if you want you can give a try.



来源:https://stackoverflow.com/questions/26239338/loading-files-into-pig-and-decompressing-them

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!