dataframe look up and optimization

前端未结

关注

 2  816

I am using spark-sql-2.4.3v with java. I have scenario below

val data = List(
  ("20", "score", "school",  14 ,12),
  ("21&


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  时光取名叫无心        
                
              
                            
                2020-12-02 03:35
              
            
            
                                                                       
If lookup data is of small size then you can create Map and broadcast it. broadcasted map can be easily used in udf as below-
Load the test data provided
 val data = List(
      ("20", "score", "school",  14 ,12),
      ("21", "score", "school",  13 , 13),
      ("22", "rate", "school",  11 ,14),
      ("23", "score", "school",  11 ,14),
      ("24", "rate", "school",  12 ,12),
      ("25", "score", "school", 11 ,14)
    )
    val df = data.toDF("id", "code", "entity", "value1","value2")
    df.show
    /**
      * +---+-----+------+------+------+
      * | id| code|entity|value1|value2|
      * +---+-----+------+------+------+
      * | 20|score|school|    14|    12|
      * | 21|score|school|    13|    13|
      * | 22| rate|school|    11|    14|
      * | 23|score|school|    11|    14|
      * | 24| rate|school|    12|    12|
      * | 25|score|school|    11|    14|
      * +---+-----+------+------+------+
      */

    //this look up data populated from DB.

    val ll = List(
      ("aaaa", 11),
      ("aaa", 12),
      ("aa", 13),
      ("a", 14)
    )
    val codeValudeDf = ll.toDF( "code", "value")
    codeValudeDf.show
    /**
      * +----+-----+
      * |code|value|
      * +----+-----+
      * |aaaa|   11|
      * | aaa|   12|
      * |  aa|   13|
      * |   a|   14|
      * +----+-----+
      */

broadcasted map can be easily used in udf as below-

    val lookUp = spark.sparkContext
      .broadcast(codeValudeDf.map{case Row(code: String, value: Integer) => value -> code}
      .collect().toMap)

    val look_up = udf((value: Integer) => lookUp.value.get(value))

    df.withColumn("value1",
      when($"code" === "score", look_up($"value1")).otherwise($"value1".cast("string")))
      .withColumn("value2",
        when($"code" === "score", look_up($"value2")).otherwise($"value2".cast("string")))
      .show(false)
    /**
      * +---+-----+------+------+------+
      * |id |code |entity|value1|value2|
      * +---+-----+------+------+------+
      * |20 |score|school|a     |aaa   |
      * |21 |score|school|aa    |aa    |
      * |22 |rate |school|11    |14    |
      * |23 |score|school|aaaa  |a     |
      * |24 |rate |school|12    |12    |
      * |25 |score|school|aaaa  |a     |
      * +---+-----+------+------+------+
      */



                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  不思量自难忘°        
                
              
                            
                2020-12-02 03:48
              
            
            
                                                                       
Using the broadcasted map indeed looks a wise decision as you do not need to hit your database to pull the lookup data every time.
Here I have solved the problem using a key-value map in a UDF. I am unable to compare its performance w.r.t. broadcasted map approach, but would welcome inputs from spark experts to opine.
Step# 1: Building KeyValueMap -
val data = List(
  ("20", "score", "school",  14 ,12),
  ("21", "score", "school",  13 , 13),
  ("22", "rate", "school",  11 ,14),
  ("23", "score", "school",  11 ,14),
  ("24", "rate", "school",  12 ,12),
  ("25", "score", "school", 11 ,14)
 )
val df = data.toDF("id", "code", "entity", "value1","value2")

val ll = List(
   ("aaaa", 11),
  ("aaa", 12),
  ("aa", 13),
  ("a", 14)
 )
val codeValudeDf = ll.toDF( "code", "value")


val Keys = codeValudeDf.select("value").collect().map(_(0).toString).toList

val Values = codeValudeDf.select("code").collect().map(_(0).toString).toList
val KeyValueMap = Keys.zip(Values).toMap

Step# 2: Creating UDF
def CodeToValue(code: String, key: String): String = { 
if (key == null) return ""
if (code != "score") return key
val result: String = KeyValueMap.getOrElse(key,"not found!") 
return result }

val CodeToValueUDF = udf (CodeToValue(_:String, _:String):String )

Step# 3: Adding derived columns using UDF in original dataframe
val newdf  = df.withColumn("Col1", CodeToValueUDF(col("code"), col("value1")))

val finaldf = newdf.withColumn("Col2", CodeToValueUDF(col("code"), col("value2")))
    
finaldf.show(false)

+---+-----+------+------+------+----+----+
| id| code|entity|value1|value2|Col1|Col2|
+---+-----+------+------+------+----+----+
| 20|score|school|    14|    12|   a| aaa|
| 21|score|school|    13|    13|  aa|  aa|
| 22| rate|school|    11|    14|  11|  14|
| 23|score|school|    11|    14|aaaa|   a|
| 24| rate|school|    12|    12|  12|  12|
| 25|score|school|    11|    14|aaaa|   a|
+---+-----+------+------+------+----+----+

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复