Zip support in Apache Spark

后端 未结 5 2002
时光取名叫无心
时光取名叫无心 2020-12-03 15:02

I have read about Spark\'s support for gzip-kind input files here, and I wonder if the same support exists for different kind of compressed files, such as

5条回答
  •  忘掉有多难
    2020-12-03 15:44

    You can use sc.binaryFiles to open the zip file in binary format, then unzip it into the text format. Unfortunately, the zip file is not split-able.. So you need to wait for the decompression, then maybe call shuffle to balance the data in each partition.

    Here is an example in Python. More info is in http://gregwiki.duckdns.org/index.php/2016/04/11/read-zip-file-in-spark/

     file_RDD = sc.binaryFiles( HDFS_path + data_path )
    
     def Zip_open( binary_stream_string ) : # New version, treat a stream as zipped file
         try :
             pseudo_file = io.BytesIO( binary_stream_string )
             zf = zipfile.ZipFile( pseudo_file )
             return zf
         except :
             return None
    
     def read_zip_lines(zipfile_object) :
         file_iter = zipfile_object.open('diff.txt')
         data =  file_iter.readlines() 
         return data
    
     My_RDD = file_RDD.map(lambda kv: (kv[0], Zip_open(kv[1])))
    

提交回复
热议问题