发表新帖

发表新帖

Ho to read “.gz” compressed file using spark DF or DS?

前端未结

关注

 1  1011

I have a compressed file with .gz format, Is it possible to read the file directly using spark DF/DS?

Details : File is csv with tab delimited.

相关标签:

1条回答

刺人心

2021-01-17 21:26
Reading a compressed csv is done in the same way as reading an uncompressed csv file. For Spark version 2.0+ it can be done as follows using Scala (note the extra option for the tab delimiter):
```
val df = spark.read.option("sep", "\t").csv("file.csv.gz")
```
PySpark:
```
df = spark.read.csv("file.csv.gz", sep='\t')
```
The only extra consideration to take into account is that the gz file is not splittable, therefore Spark needs to read the whole file using a single core which will slow things down. After the read is done the data can be shuffled to increase parallelism.
0 讨论(0)
发布评论:

提交评论
- 加载中...

热议问题