Context
Spark 2.0.1, spark-submit in cluster mode. I am reading a parquet file from hdfs:
val spark = SparkSession.builder
.ap
I was able to find a workaround (on Spark 2.1). It solves the number of files problem but might have some performance implications.
dataframe
.withColumn("bucket", pmod(hash($"bucketColumn"), lit(numBuckets)))
.repartition(numBuckets, $"bucket")
.write
.format(fmt)
.bucketBy(numBuckets, "bucketColumn")
.sortBy("bucketColumn")
.option("path", "/path/to/your/table")
.saveAsTable("table_name")
I think spark's bucketing algorithm does a positive mod of MurmurHash3 of the bucket column value. This simply replicates that logic and repartitions the data so that each partition contains all the data for a bucket.
You can do the same with partitioning + bucketing.
dataframe
.withColumn("bucket", pmod(hash($"bucketColumn"), lit(numBuckets)))
.repartition(numBuckets, $"partitionColumn", $"bucket")
.write
.format(fmt)
.partitionBy("partitionColumn")
.bucketBy(numBuckets, "bucketColumn")
.sortBy("bucketColumn")
.option("path", "/path/to/your/table")
.saveAsTable("table_name")
Tested with 3 partitions and 5 buckets locally using csv format (both partition and bucket columns are just numbers):
$ tree .
.
├── _SUCCESS
├── partitionColumn=0
│ ├── bucket=0
│ │ └── part-00004-c2f2b7b5-40a1-4d24-8c05-084b7a05e399_00000.csv
│ ├── bucket=1
│ │ └── part-00003-c2f2b7b5-40a1-4d24-8c05-084b7a05e399_00001.csv
│ ├── bucket=2
│ │ └── part-00002-c2f2b7b5-40a1-4d24-8c05-084b7a05e399_00002.csv
│ ├── bucket=3
│ │ └── part-00004-c2f2b7b5-40a1-4d24-8c05-084b7a05e399_00003.csv
│ └── bucket=4
│ └── part-00001-c2f2b7b5-40a1-4d24-8c05-084b7a05e399_00004.csv
├── partitionColumn=1
│ ├── bucket=0
│ │ └── part-00002-c2f2b7b5-40a1-4d24-8c05-084b7a05e399_00000.csv
│ ├── bucket=1
│ │ └── part-00004-c2f2b7b5-40a1-4d24-8c05-084b7a05e399_00001.csv
│ ├── bucket=2
│ │ └── part-00002-c2f2b7b5-40a1-4d24-8c05-084b7a05e399_00002.csv
│ ├── bucket=3
│ │ └── part-00001-c2f2b7b5-40a1-4d24-8c05-084b7a05e399_00003.csv
│ └── bucket=4
│ └── part-00003-c2f2b7b5-40a1-4d24-8c05-084b7a05e399_00004.csv
└── partitionColumn=2
├── bucket=0
│ └── part-00000-c2f2b7b5-40a1-4d24-8c05-084b7a05e399_00000.csv
├── bucket=1
│ └── part-00001-c2f2b7b5-40a1-4d24-8c05-084b7a05e399_00001.csv
├── bucket=2
│ └── part-00001-c2f2b7b5-40a1-4d24-8c05-084b7a05e399_00002.csv
├── bucket=3
│ └── part-00003-c2f2b7b5-40a1-4d24-8c05-084b7a05e399_00003.csv
└── bucket=4
└── part-00000-c2f2b7b5-40a1-4d24-8c05-084b7a05e399_00004.csv
Here's the bucket=0 for all 3 partitions (you can see that they are all the same values):
$ paste partitionColumn=0/bucket=0/part-00004-5f860e5c-f2c2-4d52-8035-aa00e4432770_00000.csv partitionColumn=1/bucket=0/part-00002-5f860e5c-f2c2-4d52-8035-aa00e4432770_00000.csv partitionColumn=2/bucket=0/part-00000-5f860e5c-f2c2-4d52-8035-aa00e4432770_00000.csv | head
0 0 0
4 4 4
6 6 6
16 16 16
18 18 18
20 20 20
26 26 26
27 27 27
29 29 29
32 32 32
I actually liked the extra bucket index. But if you don't, you can drop the bucket column right before write and you'll get the numBuckets number of files per partition.