Error using spark 'save' does not support bucketing right now

问题

I have a DataFrame which I am trying to partitionBy a column, sort it by that column and save in parquet format using the following command:

df.write().format("parquet")
  .partitionBy("dynamic_col")
  .sortBy("dynamic_col")
  .save("test.parquet");

I get the following error:

reason: User class threw exception: org.apache.spark.sql.AnalysisException: 'save' does not support bucketing right now;

Is save(...) not allowed? Is only saveAsTable(...) allowed which saves the data to Hive?

Any suggestions are helpful.

回答1:

The problem is that sortBy is currently (Spark 2.3.1) supported only together with bucketing and bucketing needs to be used in combination with saveAsTable and also the bucket sorting column should not be part of partition columns.

So you have two options:

Do not use sortBy:

df.write
.format("parquet")
.partitionBy("dynamic_col")
.option("path", output_path)
.save()

Use sortBy with bucketing and save it through the metastore using saveAsTable:

df.write
.format("parquet")
.partitionBy("dynamic_col")
.bucketBy(n, bucket_col)
.sortBy(bucket_col)
.option("path", output_path)
.saveAsTable(table_name)

回答2:

Try

df.repartition("dynamic_col").write.partitionBy("dynamic_col").parquet("test.parquet")

来源：https://stackoverflow.com/questions/52799025/error-using-spark-save-does-not-support-bucketing-right-now

标签

apache-spark

apache-spark-sql

partitioning

parquet

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!