Ho to read “.gz” compressed file using spark DF or DS?

问题

I have a compressed file with .gz format, Is it possible to read the file directly using spark DF/DS?

Details : File is csv with tab delimited.

回答1:

Reading a compressed csv is done in the same way as reading an uncompressed csv file. For Spark version 2.0+ it can be done as follows using Scala (note the extra option for the tab delimiter):

val df = spark.read.option("sep", "\t").csv("file.csv.gz")

PySpark:

df = spark.read.csv("file.csv.gz", sep='\t')

The only extra consideration to take into account is that the gz file is not splittable, therefore Spark needs to read the whole file using a single core which will slow things down. After the read is done the data can be shuffled to increase parallelism.

来源：https://stackoverflow.com/questions/49490640/ho-to-read-gz-compressed-file-using-spark-df-or-ds

标签

apache-spark

apache-spark-sql

spark-dataframe

gzip

apache-spark-dataset

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!