Dealing with a large gzipped file in Spark

后端未结

关注

 3  1019

礼貌的吻别 2020-12-05 19:35

I have a large (about 85 GB compressed) gzipped file from s3 that I am trying to process with Spark on AWS EMR (right now with an m4.xlarge master instance and two m4.10xlar

3条回答

萌比男神i (楼主)

2020-12-05 20:17
Spark can parallelize reading a single gzip file.

The best you can do split it in chunks that are gzipped.

However, Spark is really slow at reading gzip files. You can do this to speed it up:
```
file_names_rdd = sc.parallelize(list_of_files, 100)
lines_rdd = file_names_rdd.flatMap(lambda _: gzip.open(_).readlines())
```
Going through Python is twice has fast as reading the native Spark gzip reader.
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...