问题
When I write data from dataframe into parquet table ( which is partitioned ) after all the tasks are successful, process is stuck at updating partition stats.
16/10/05 03:46:13 WARN log: Updating partition stats fast for:
16/10/05 03:46:14 WARN log: Updated size to 143452576
16/10/05 03:48:30 WARN log: Updating partition stats fast for:
16/10/05 03:48:31 WARN log: Updated size to 147382813
16/10/05 03:51:02 WARN log: Updating partition stats fast for:
df.write.format("parquet").mode("overwrite").partitionBy(part1).insertInto(db.tbl)
My table has > 400 columns and > 1000 partitions. Please let me know if we can optimize and speedup updating partition stats.
回答1:
I feel the problem here is there are too many partitions for a > 400 columns file. Every time you overwrite a table in hive , the statistics are updated. IN your case it will try to update statistics for 1000 partitions and again each partition has data with > 400 columns.
Try reducing the number of partitions (use another partition column or if it is a date column consider partitioning by month) and you should be able to see a significant change in performance.
来源:https://stackoverflow.com/questions/39869728/spark-data-frame-write-to-parquet-table-slow-at-updating-partition-stats