问题
I am loading a bunch of files from Azure storage into pig. Pig has default support for gzip so if the file extensions are .gz everything works fine.
Problem is that older files are stored with .zip extension (I have millions of those).
Is there a way to tell pig to load files and treat .zip as gzip?
回答1:
I really don't know some other options are available but you can try something like this
- write a bash script which will convert the given zip file to gz file
- load the gz file in pig
Just a sample example for one file, you may need to change the script according to your need.
input.zip
1,john
2,cena
3,rock
4,sam
test.sh
#!/bin/bash
FILE_NAME=$(echo $1 | cut -d '.' -f1)
unzip "$1"
tar czf "$FILE_NAME.gz" "$FILE_NAME"
pig -x local -param PIG_INPUT_FILE="$FILE_NAME.gz" -f myscript.pig
myscript.pig
A = LOAD '$PIG_INPUT_FILE' USING PigStorage(',');
DUMP A;
Output:
$ ./test.sh input.zip
(1,john)
(2,cena)
(3,rock)
(4,sam)
The other possible option is you may need to write a UDF to convert zip to gz using java.util.zip
library and call LoadFunc
option. I didn't try this option but if you want you can give a try.
来源:https://stackoverflow.com/questions/26239338/loading-files-into-pig-and-decompressing-them