Why is Spark saveAsTable with bucketBy creating thousands of files?

后端 未结 3 597
野的像风
野的像风 2020-12-24 02:38

Context

Spark 2.0.1, spark-submit in cluster mode. I am reading a parquet file from hdfs:

val spark = SparkSession.builder
      .ap         


        
3条回答
  •  暗喜
    暗喜 (楼主)
    2020-12-24 03:12

    Please use spark sql which will use HiveContext to write data into Hive table, so it will use the number of buckets which you have configured in the table schema.

     SparkSession.builder().
      config("hive.exec.dynamic.partition", "true").
      config("hive.exec.dynamic.partition.mode", "nonstrict").
      config("hive.execution.engine","tez").
      config("hive.exec.max.dynamic.partitions","400").
      config("hive.exec.max.dynamic.partitions.pernode","400").
      config("hive.enforce.bucketing","true").
      config("optimize.sort.dynamic.partitionining","true").
      config("hive.vectorized.execution.enabled","true").
      config("hive.enforce.sorting","true").
      enableHiveSupport().getOrCreate()
    
    spark.sql(s"insert into hiveTableName partition (partition_column) select * from  myParquetFile")
    

    The bucketing implementation of spark is not honoring the specified number of bucket size. Each partitions is writing into a separate files, hence you end up with lot of files for each bucket.

    Please refer this link https://www.slideshare.net/databricks/hive-bucketing-in-apache-spark-with-tejas-patil

    Hope this helps.

    Ravi

提交回复
热议问题