s3 parquet write - too many partitions, slow writing

问题

I have my scala spark job to write in to s3 as parquet file. Its 6 billion records so far and it will keep growing daily. As per the use case, our api will query the parquet based on id. So to make the query results faster, i am writing the parquet with partitions on id. However, we have 1330360 unique ids and so this is creating 1330360 parquet files while writing, so the writing step is very slow, writing for past 9 hours and its still running.

output.write.mode("append").partitionBy("id").parquet("s3a://datalake/db/")

Is there anyway, i can reduce the number of partitions and still make the read query faster ? Or any other better way to handle this scenario ? Thanks.

EDIT : - id is an integer column with random numbers.

回答1:

you can partition by ranges of ids (you didn't say anything about the ids so I can't suggest something specific) and/or use buckets instead of partitions https://www.slideshare.net/TejasPatil1/hive-bucketing-in-apache-spark

来源：https://stackoverflow.com/questions/50197782/s3-parquet-write-too-many-partitions-slow-writing

标签

scala

apache-spark

amazon-s3

amazon-emr

parquet

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!