Why is Spark saveAsTable with bucketBy creating thousands of files?

后端未结

关注

 3  580

野的像风 2020-12-24 02:38

Context

Spark 2.0.1, spark-submit in cluster mode. I am reading a parquet file from hdfs:

val spark = SparkSession.builder
      .ap


      
      
        
          3条回答        

        
                    
            
            
                         
                
              
              
                
                   离开以前
                                             
                
                
                (楼主)
            
              
              
                2020-12-24 03:07
              

            
            
                        
I was able to find a workaround (on Spark 2.1). It solves the number of files problem but might have some performance implications.

dataframe
  .withColumn("bucket", pmod(hash($"bucketColumn"), lit(numBuckets)))
  .repartition(numBuckets, $"bucket")
  .write
  .format(fmt)
  .bucketBy(numBuckets, "bucketColumn")
  .sortBy("bucketColumn")
  .option("path", "/path/to/your/table")
  .saveAsTable("table_name")


I think spark's bucketing algorithm does a positive mod of MurmurHash3 of the bucket column value. This simply replicates that logic and repartitions the data so that each partition contains all the data for a bucket.

You can do the same with partitioning + bucketing.

dataframe
  .withColumn("bucket", pmod(hash($"bucketColumn"), lit(numBuckets)))
  .repartition(numBuckets, $"partitionColumn", $"bucket")
  .write
  .format(fmt)
  .partitionBy("partitionColumn")
  .bucketBy(numBuckets, "bucketColumn")
  .sortBy("bucketColumn")
  .option("path", "/path/to/your/table")
  .saveAsTable("table_name")


Tested with 3 partitions and 5 buckets locally using csv format (both partition and bucket columns are just numbers):

$ tree .
.
├── _SUCCESS
├── partitionColumn=0
│   ├── bucket=0
│   │   └── part-00004-c2f2b7b5-40a1-4d24-8c05-084b7a05e399_00000.csv
│   ├── bucket=1
│   │   └── part-00003-c2f2b7b5-40a1-4d24-8c05-084b7a05e399_00001.csv
│   ├── bucket=2
│   │   └── part-00002-c2f2b7b5-40a1-4d24-8c05-084b7a05e399_00002.csv
│   ├── bucket=3
│   │   └── part-00004-c2f2b7b5-40a1-4d24-8c05-084b7a05e399_00003.csv
│   └── bucket=4
│       └── part-00001-c2f2b7b5-40a1-4d24-8c05-084b7a05e399_00004.csv
├── partitionColumn=1
│   ├── bucket=0
│   │   └── part-00002-c2f2b7b5-40a1-4d24-8c05-084b7a05e399_00000.csv
│   ├── bucket=1
│   │   └── part-00004-c2f2b7b5-40a1-4d24-8c05-084b7a05e399_00001.csv
│   ├── bucket=2
│   │   └── part-00002-c2f2b7b5-40a1-4d24-8c05-084b7a05e399_00002.csv
│   ├── bucket=3
│   │   └── part-00001-c2f2b7b5-40a1-4d24-8c05-084b7a05e399_00003.csv
│   └── bucket=4
│       └── part-00003-c2f2b7b5-40a1-4d24-8c05-084b7a05e399_00004.csv
└── partitionColumn=2
    ├── bucket=0
    │   └── part-00000-c2f2b7b5-40a1-4d24-8c05-084b7a05e399_00000.csv
    ├── bucket=1
    │   └── part-00001-c2f2b7b5-40a1-4d24-8c05-084b7a05e399_00001.csv
    ├── bucket=2
    │   └── part-00001-c2f2b7b5-40a1-4d24-8c05-084b7a05e399_00002.csv
    ├── bucket=3
    │   └── part-00003-c2f2b7b5-40a1-4d24-8c05-084b7a05e399_00003.csv
    └── bucket=4
        └── part-00000-c2f2b7b5-40a1-4d24-8c05-084b7a05e399_00004.csv


Here's the bucket=0 for all 3 partitions (you can see that they are all the same values):

$ paste partitionColumn=0/bucket=0/part-00004-5f860e5c-f2c2-4d52-8035-aa00e4432770_00000.csv partitionColumn=1/bucket=0/part-00002-5f860e5c-f2c2-4d52-8035-aa00e4432770_00000.csv partitionColumn=2/bucket=0/part-00000-5f860e5c-f2c2-4d52-8035-aa00e4432770_00000.csv | head
0   0   0
4   4   4
6   6   6
16  16  16
18  18  18
20  20  20
26  26  26
27  27  27
29  29  29
32  32  32


I actually liked the extra bucket index. But if you don't, you can drop the bucket column right before write and you'll get the numBuckets number of files per partition.
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它3个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复