Why is Spark saveAsTable with bucketBy creating thousands of files?

后端 未结 3 580
野的像风
野的像风 2020-12-24 02:38

Context

Spark 2.0.1, spark-submit in cluster mode. I am reading a parquet file from hdfs:

val spark = SparkSession.builder
      .ap         


        
3条回答
  •  离开以前
    2020-12-24 03:07

    I was able to find a workaround (on Spark 2.1). It solves the number of files problem but might have some performance implications.

    dataframe
      .withColumn("bucket", pmod(hash($"bucketColumn"), lit(numBuckets)))
      .repartition(numBuckets, $"bucket")
      .write
      .format(fmt)
      .bucketBy(numBuckets, "bucketColumn")
      .sortBy("bucketColumn")
      .option("path", "/path/to/your/table")
      .saveAsTable("table_name")
    

    I think spark's bucketing algorithm does a positive mod of MurmurHash3 of the bucket column value. This simply replicates that logic and repartitions the data so that each partition contains all the data for a bucket.

    You can do the same with partitioning + bucketing.

    dataframe
      .withColumn("bucket", pmod(hash($"bucketColumn"), lit(numBuckets)))
      .repartition(numBuckets, $"partitionColumn", $"bucket")
      .write
      .format(fmt)
      .partitionBy("partitionColumn")
      .bucketBy(numBuckets, "bucketColumn")
      .sortBy("bucketColumn")
      .option("path", "/path/to/your/table")
      .saveAsTable("table_name")
    

    Tested with 3 partitions and 5 buckets locally using csv format (both partition and bucket columns are just numbers):

    $ tree .
    .
    ├── _SUCCESS
    ├── partitionColumn=0
    │   ├── bucket=0
    │   │   └── part-00004-c2f2b7b5-40a1-4d24-8c05-084b7a05e399_00000.csv
    │   ├── bucket=1
    │   │   └── part-00003-c2f2b7b5-40a1-4d24-8c05-084b7a05e399_00001.csv
    │   ├── bucket=2
    │   │   └── part-00002-c2f2b7b5-40a1-4d24-8c05-084b7a05e399_00002.csv
    │   ├── bucket=3
    │   │   └── part-00004-c2f2b7b5-40a1-4d24-8c05-084b7a05e399_00003.csv
    │   └── bucket=4
    │       └── part-00001-c2f2b7b5-40a1-4d24-8c05-084b7a05e399_00004.csv
    ├── partitionColumn=1
    │   ├── bucket=0
    │   │   └── part-00002-c2f2b7b5-40a1-4d24-8c05-084b7a05e399_00000.csv
    │   ├── bucket=1
    │   │   └── part-00004-c2f2b7b5-40a1-4d24-8c05-084b7a05e399_00001.csv
    │   ├── bucket=2
    │   │   └── part-00002-c2f2b7b5-40a1-4d24-8c05-084b7a05e399_00002.csv
    │   ├── bucket=3
    │   │   └── part-00001-c2f2b7b5-40a1-4d24-8c05-084b7a05e399_00003.csv
    │   └── bucket=4
    │       └── part-00003-c2f2b7b5-40a1-4d24-8c05-084b7a05e399_00004.csv
    └── partitionColumn=2
        ├── bucket=0
        │   └── part-00000-c2f2b7b5-40a1-4d24-8c05-084b7a05e399_00000.csv
        ├── bucket=1
        │   └── part-00001-c2f2b7b5-40a1-4d24-8c05-084b7a05e399_00001.csv
        ├── bucket=2
        │   └── part-00001-c2f2b7b5-40a1-4d24-8c05-084b7a05e399_00002.csv
        ├── bucket=3
        │   └── part-00003-c2f2b7b5-40a1-4d24-8c05-084b7a05e399_00003.csv
        └── bucket=4
            └── part-00000-c2f2b7b5-40a1-4d24-8c05-084b7a05e399_00004.csv
    

    Here's the bucket=0 for all 3 partitions (you can see that they are all the same values):

    $ paste partitionColumn=0/bucket=0/part-00004-5f860e5c-f2c2-4d52-8035-aa00e4432770_00000.csv partitionColumn=1/bucket=0/part-00002-5f860e5c-f2c2-4d52-8035-aa00e4432770_00000.csv partitionColumn=2/bucket=0/part-00000-5f860e5c-f2c2-4d52-8035-aa00e4432770_00000.csv | head
    0   0   0
    4   4   4
    6   6   6
    16  16  16
    18  18  18
    20  20  20
    26  26  26
    27  27  27
    29  29  29
    32  32  32
    

    I actually liked the extra bucket index. But if you don't, you can drop the bucket column right before write and you'll get the numBuckets number of files per partition.

提交回复
热议问题