Zip support in Apache Spark

后端未结

关注

 5  2002

时光取名叫无心 2020-12-03 15:02

I have read about Spark\'s support for gzip-kind input files here, and I wonder if the same support exists for different kind of compressed files, such as

5条回答

忘掉有多难 (楼主)

2020-12-03 15:44

You can use sc.binaryFiles to open the zip file in binary format, then unzip it into the text format. Unfortunately, the zip file is not split-able.. So you need to wait for the decompression, then maybe call shuffle to balance the data in each partition.

Here is an example in Python. More info is in http://gregwiki.duckdns.org/index.php/2016/04/11/read-zip-file-in-spark/

 file_RDD = sc.binaryFiles( HDFS_path + data_path )

 def Zip_open( binary_stream_string ) : # New version, treat a stream as zipped file
     try :
         pseudo_file = io.BytesIO( binary_stream_string )
         zf = zipfile.ZipFile( pseudo_file )
         return zf
     except :
         return None

 def read_zip_lines(zipfile_object) :
     file_iter = zipfile_object.open('diff.txt')
     data =  file_iter.readlines() 
     return data

 My_RDD = file_RDD.map(lambda kv: (kv[0], Zip_open(kv[1])))

0 讨论(0)

查看其它5个回答