Spark Data Frame write to parquet table - slow at updating partition stats

自古美人都是妖i 提交于 2019-12-11 04:45:12

问题


When I write data from dataframe into parquet table ( which is partitioned ) after all the tasks are successful, process is stuck at updating partition stats.

16/10/05 03:46:13 WARN log: Updating partition stats fast for: 
16/10/05 03:46:14 WARN log: Updated size to 143452576
16/10/05 03:48:30 WARN log: Updating partition stats fast for: 
16/10/05 03:48:31 WARN log: Updated size to 147382813
16/10/05 03:51:02 WARN log: Updating partition stats fast for: 



df.write.format("parquet").mode("overwrite").partitionBy(part1).insertInto(db.tbl)

My table has > 400 columns and > 1000 partitions. Please let me know if we can optimize and speedup updating partition stats.


回答1:


I feel the problem here is there are too many partitions for a > 400 columns file. Every time you overwrite a table in hive , the statistics are updated. IN your case it will try to update statistics for 1000 partitions and again each partition has data with > 400 columns.

Try reducing the number of partitions (use another partition column or if it is a date column consider partitioning by month) and you should be able to see a significant change in performance.



来源:https://stackoverflow.com/questions/39869728/spark-data-frame-write-to-parquet-table-slow-at-updating-partition-stats

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!