Why are my `binaryFiles` empty when I collect them in pyspark?

问题

I have two zip files on hdfs in the same folder : /user/path-to-folder-with-zips/.

I pass that to "binaryfiles" in pyspark:

zips = sc.binaryFiles('/user/path-to-folder-with-zips/')

I'm trying to unzip the zip files and do things to the text files in them, so I tried to just see what the content will be when I try to deal with the RDD. I did it like this:

zips_collected = zips.collect()

But, when I do that, it gives an empty list:

>> zips_collected
[]

I know that the zips are not empty - they have textfiles. The documentation here says

Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.

What am I doing wrong here? I know I can't view the contents of the file because it is zipped and therefore binary. But, I should at least be able to see SOMETHING. Why does it not return anything?

There can be more than one file per zip file, but the contents are always something like this:

rownum|data|data|data|data|data
rownum|data|data|data|data|data
rownum|data|data|data|data|data

回答1:

I'm assuming that each zip file contains a single text file (code is easily changed for multiple text files). You need to read the contents of the zip file first via io.BytesIO before processing line by line. Solution is loosely based on https://stackoverflow.com/a/36511190/234233.

import io
import gzip

def zip_extract(x):
    """Extract *.gz file in memory for Spark"""
    file_obj = gzip.GzipFile(fileobj=io.BytesIO(x[1]), mode="r")
    return file_obj.read()

zip_data = sc.binaryFiles('/user/path-to-folder-with-zips/*.zip')
results = zip_data.map(zip_extract) \
                  .flatMap(lambda zip_file: zip_file.split("\n")) \
                  .map(lambda line: parse_line(line))
                  .collect()

来源：https://stackoverflow.com/questions/38256631/why-are-my-binaryfiles-empty-when-i-collect-them-in-pyspark

标签

python

Hadoop

zip

pyspark

binaryfiles