Dealing with a large gzipped file in Spark

后端未结

关注

 3  1026

礼貌的吻别 2020-12-05 19:35

I have a large (about 85 GB compressed) gzipped file from s3 that I am trying to process with Spark on AWS EMR (right now with an m4.xlarge master instance and two m4.10xlar

3条回答

臣服心动 (楼主)

2020-12-05 20:26
I have faced this problem and here is the solution.

Best way to approach this problem is to unzip the .gz file before our Spark batch run. Then use this unzip file, after that we can use Spark parallelism.

Code to unzip the .gz file.
```
import gzip
import shutil
with open('file.txt.gz', 'rb') as f_in, gzip.open('file.txt', 'wb') as f_out:
    shutil.copyfileobj(f_in, f_out)
```
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...