Is gzip format supported in Spark?

前端 未结 1 1086
佛祖请我去吃肉
佛祖请我去吃肉 2020-11-29 03:05

For a Big Data project, I\'m planning to use spark, which has some nice features like in-memory-computations for repeated workloads. It can run on local files or on top of H

相关标签:
1条回答
  • 2020-11-29 03:29

    From the Spark Scala Programming guide's section on "Hadoop Datasets":

    Spark can create distributed datasets from any file stored in the Hadoop distributed file system (HDFS) or other storage systems supported by Hadoop (including your local file system, Amazon S3, Hypertable, HBase, etc). Spark supports text files, SequenceFiles, and any other Hadoop InputFormat.

    Support for gzip input files should work the same as it does in Hadoop. For example, sc.textFile("myFile.gz") should automatically decompress and read gzip-compressed files (textFile() is actually implemented using Hadoop's TextInputFormat, which supports gzip-compressed files).

    As mentioned by @nick-chammas in the comments:

    note that if you call sc.textFile() on a gzipped file, Spark will give you an RDD with only 1 partition (as of 0.9.0). This is because gzipped files are not splittable. If you don't repartition the RDD somehow, any operations on that RDD will be limited to a single core

    0 讨论(0)
提交回复
热议问题