Reading in multiple files compressed in tar.gz archive into Spark [duplicate]

This question already has an answer here:

Read whole text files from a compression in Spark 2 answers

I'm trying to create a Spark RDD from several json files compressed into a tar. For example, I have 3 files

file1.json
file2.json
file3.json

And these are contained in archive.tar.gz.

I want to create a dataframe from the json files. The problem is that Spark is not reading in the json files correctly. Creating an RDD using sqlContext.read.json("archive.tar.gz") or sc.textFile("archive.tar.gz") results in garbled/extra output.

Is there some way to handle gzipped archives containing multiple files in Spark?

UPDATE

Using the method given in the answer to Read whole text files from a compression in Spark I was able to get things running, but this method does not seem to be suitable for large tar.gz archives (>200 mb compressed) as the application chokes up on large archive sizes. As some of the archives I'm dealing with reach sizes upto 2 GB after compression I'm wondering if there is some efficient way to deal with the problem.

I'm trying to avoid extracting the archives and then merging the files together as this would be time consuming.

septra

A solution is given in Read whole text files from a compression in Spark . Using the code sample provided, I was able to create a dataframe from the compressed archive like so:

val jsonRDD = sc.binaryFiles("gzarchive/*").
               flatMapValues(x => extractFiles(x).toOption).
               mapValues(_.map(decode())

val df = sqlContext.read.json(jsonRDD.map(_._2).flatMap(x => x))

This method works fine for tar archives of a relatively small size, but is not suitable for large archive sizes.

A better solution to the problem seems to be to convert the tar archives to hadoop SequenceFiles, which are splittable and hence can be read and processed in parallel in Spark (as opposed to tar archives.)

See: stuartsierra.com/2008/04/24/a-million-little-files

Files inside of a *.tar.gz file, as you already have mentioned are compressed. You cannot put the 3 files into a single compressed tar file and expect the import function (which is looking for only text) to know how to handle decompressing the files, unpacking them from the tar archive, and then importing each file individually.

I would recommend you take the time to upload each individual json file manually since both sc.textfile and sqlcontext.read.json functions cannot handle compressed data.

来源：https://stackoverflow.com/questions/38635905/reading-in-multiple-files-compressed-in-tar-gz-archive-into-spark

标签

scala

apache-spark

gzip

rdd