Why is Spark saveAsTable with bucketBy creating thousands of files?

后端 未结 3 583
野的像风
野的像风 2020-12-24 02:38

Context

Spark 2.0.1, spark-submit in cluster mode. I am reading a parquet file from hdfs:

val spark = SparkSession.builder
      .ap         


        
3条回答
  •  遥遥无期
    2020-12-24 03:28

    In my mind also these questions poped up when I saw too many files so searched and found this

    "Unlike bucketing in Apache Hive, Spark SQL creates the bucket files per the number of buckets and partitions. In other words, the number of bucketing files is the number of buckets multiplied by the number of task writers (one per partition)."

    Source: https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-bucketing.html

    I think this answers your question why this no. of files

    Your question no. 2 can be answered like If we could manage no. of partitions by repartition, provided the resource available, we can limit the files created.

提交回复
热议问题