Reading in multiple files compressed in tar.gz archive into Spark [duplicate]

南楼画角 提交于 2019-11-27 09:08:05
septra

A solution is given in Read whole text files from a compression in Spark . Using the code sample provided, I was able to create a dataframe from the compressed archive like so:

val jsonRDD = sc.binaryFiles("gzarchive/*").
               flatMapValues(x => extractFiles(x).toOption).
               mapValues(_.map(decode())

val df = sqlContext.read.json(jsonRDD.map(_._2).flatMap(x => x))

This method works fine for tar archives of a relatively small size, but is not suitable for large archive sizes.

A better solution to the problem seems to be to convert the tar archives to hadoop SequenceFiles, which are splittable and hence can be read and processed in parallel in Spark (as opposed to tar archives.)

See: stuartsierra.com/2008/04/24/a-million-little-files

Files inside of a *.tar.gz file, as you already have mentioned are compressed. You cannot put the 3 files into a single compressed tar file and expect the import function (which is looking for only text) to know how to handle decompressing the files, unpacking them from the tar archive, and then importing each file individually.

I would recommend you take the time to upload each individual json file manually since both sc.textfile and sqlcontext.read.json functions cannot handle compressed data.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!