Why are my `binaryFiles` empty when I collect them in pyspark?

▼魔方 西西 提交于 2019-12-06 12:35:34

问题


I have two zip files on hdfs in the same folder : /user/path-to-folder-with-zips/.

I pass that to "binaryfiles" in pyspark:

zips = sc.binaryFiles('/user/path-to-folder-with-zips/')

I'm trying to unzip the zip files and do things to the text files in them, so I tried to just see what the content will be when I try to deal with the RDD. I did it like this:

zips_collected = zips.collect()

But, when I do that, it gives an empty list:

>> zips_collected
[]

I know that the zips are not empty - they have textfiles. The documentation here says

Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.

What am I doing wrong here? I know I can't view the contents of the file because it is zipped and therefore binary. But, I should at least be able to see SOMETHING. Why does it not return anything?

There can be more than one file per zip file, but the contents are always something like this:

rownum|data|data|data|data|data
rownum|data|data|data|data|data
rownum|data|data|data|data|data

回答1:


I'm assuming that each zip file contains a single text file (code is easily changed for multiple text files). You need to read the contents of the zip file first via io.BytesIO before processing line by line. Solution is loosely based on https://stackoverflow.com/a/36511190/234233.

import io
import gzip

def zip_extract(x):
    """Extract *.gz file in memory for Spark"""
    file_obj = gzip.GzipFile(fileobj=io.BytesIO(x[1]), mode="r")
    return file_obj.read()

zip_data = sc.binaryFiles('/user/path-to-folder-with-zips/*.zip')
results = zip_data.map(zip_extract) \
                  .flatMap(lambda zip_file: zip_file.split("\n")) \
                  .map(lambda line: parse_line(line))
                  .collect()


来源:https://stackoverflow.com/questions/38256631/why-are-my-binaryfiles-empty-when-i-collect-them-in-pyspark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!