Spark Small ORC Stripes

跟風遠走 提交于 2021-01-28 11:58:32

问题


We use Spark to flatten out clickstream data and then write the same to S3 in ORC+zlib format, I have tried changing many settings in Spark but still the resultant stripe sizes of the ORC file getting created are very small (<2MB)

Things which I tried so far to decrease the stripe size,

Earlier each file was 20MB in size, using coalesce I am now creating files which are of 250-300MB in size, but still there are 200 stripes per file i.e each stripe <2MB

Tried using hivecontext instead of sparkcontext by setting hive.exec.orc.default.stripe.size to 67108864, but spark isn't honoring these parameters.

So, Any idea on how can I increase the stripe sizes of ORC files being created ? because the problem with small stripes is , when we are querying these ORC files using Presto and when stripe size is less than 8MB, then Presto will read the whole data file instead of the selected fields in the query.

Presto Stripe issue related thread: https://groups.google.com/forum/#!topic/presto-users/7NcrFvGpPaA


回答1:


I have posted the same question over HDP Community platform and I got the below response,

"It's related to HIVE-13232 (fixed in Hive 1.3.0, 2.0.1, 2.1.0), but all Apache Spark still uses Hive 1.2.1 library.

Could you try HDP 2.6.3+ (2.6.4 is the latest one). HDP Spark 2.2 has that fixed hive library."



来源:https://stackoverflow.com/questions/48250778/spark-small-orc-stripes

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!