dataframe look up and optimization

前端 未结 2 807
被撕碎了的回忆
被撕碎了的回忆 2020-12-02 03:08

I am using spark-sql-2.4.3v with java. I have scenario below

val data = List(
  ("20", "score", "school",  14 ,12),
  ("21&         


        
相关标签:
2条回答
  • 2020-12-02 03:35

    If lookup data is of small size then you can create Map and broadcast it. broadcasted map can be easily used in udf as below-

    Load the test data provided

     val data = List(
          ("20", "score", "school",  14 ,12),
          ("21", "score", "school",  13 , 13),
          ("22", "rate", "school",  11 ,14),
          ("23", "score", "school",  11 ,14),
          ("24", "rate", "school",  12 ,12),
          ("25", "score", "school", 11 ,14)
        )
        val df = data.toDF("id", "code", "entity", "value1","value2")
        df.show
        /**
          * +---+-----+------+------+------+
          * | id| code|entity|value1|value2|
          * +---+-----+------+------+------+
          * | 20|score|school|    14|    12|
          * | 21|score|school|    13|    13|
          * | 22| rate|school|    11|    14|
          * | 23|score|school|    11|    14|
          * | 24| rate|school|    12|    12|
          * | 25|score|school|    11|    14|
          * +---+-----+------+------+------+
          */
    
        //this look up data populated from DB.
    
        val ll = List(
          ("aaaa", 11),
          ("aaa", 12),
          ("aa", 13),
          ("a", 14)
        )
        val codeValudeDf = ll.toDF( "code", "value")
        codeValudeDf.show
        /**
          * +----+-----+
          * |code|value|
          * +----+-----+
          * |aaaa|   11|
          * | aaa|   12|
          * |  aa|   13|
          * |   a|   14|
          * +----+-----+
          */
    

    broadcasted map can be easily used in udf as below-

    
        val lookUp = spark.sparkContext
          .broadcast(codeValudeDf.map{case Row(code: String, value: Integer) => value -> code}
          .collect().toMap)
    
        val look_up = udf((value: Integer) => lookUp.value.get(value))
    
        df.withColumn("value1",
          when($"code" === "score", look_up($"value1")).otherwise($"value1".cast("string")))
          .withColumn("value2",
            when($"code" === "score", look_up($"value2")).otherwise($"value2".cast("string")))
          .show(false)
        /**
          * +---+-----+------+------+------+
          * |id |code |entity|value1|value2|
          * +---+-----+------+------+------+
          * |20 |score|school|a     |aaa   |
          * |21 |score|school|aa    |aa    |
          * |22 |rate |school|11    |14    |
          * |23 |score|school|aaaa  |a     |
          * |24 |rate |school|12    |12    |
          * |25 |score|school|aaaa  |a     |
          * +---+-----+------+------+------+
          */
    
    
    
    0 讨论(0)
  • 2020-12-02 03:48

    Using the broadcasted map indeed looks a wise decision as you do not need to hit your database to pull the lookup data every time.

    Here I have solved the problem using a key-value map in a UDF. I am unable to compare its performance w.r.t. broadcasted map approach, but would welcome inputs from spark experts to opine.

    Step# 1: Building KeyValueMap -

    val data = List(
      ("20", "score", "school",  14 ,12),
      ("21", "score", "school",  13 , 13),
      ("22", "rate", "school",  11 ,14),
      ("23", "score", "school",  11 ,14),
      ("24", "rate", "school",  12 ,12),
      ("25", "score", "school", 11 ,14)
     )
    val df = data.toDF("id", "code", "entity", "value1","value2")
    
    val ll = List(
       ("aaaa", 11),
      ("aaa", 12),
      ("aa", 13),
      ("a", 14)
     )
    val codeValudeDf = ll.toDF( "code", "value")
    
    
    val Keys = codeValudeDf.select("value").collect().map(_(0).toString).toList
    
    val Values = codeValudeDf.select("code").collect().map(_(0).toString).toList
    val KeyValueMap = Keys.zip(Values).toMap
    

    Step# 2: Creating UDF

    def CodeToValue(code: String, key: String): String = { 
    if (key == null) return ""
    if (code != "score") return key
    val result: String = KeyValueMap.getOrElse(key,"not found!") 
    return result }
    
    val CodeToValueUDF = udf (CodeToValue(_:String, _:String):String )
    

    Step# 3: Adding derived columns using UDF in original dataframe

    val newdf  = df.withColumn("Col1", CodeToValueUDF(col("code"), col("value1")))
    
    val finaldf = newdf.withColumn("Col2", CodeToValueUDF(col("code"), col("value2")))
        
    finaldf.show(false)
    
    +---+-----+------+------+------+----+----+
    | id| code|entity|value1|value2|Col1|Col2|
    +---+-----+------+------+------+----+----+
    | 20|score|school|    14|    12|   a| aaa|
    | 21|score|school|    13|    13|  aa|  aa|
    | 22| rate|school|    11|    14|  11|  14|
    | 23|score|school|    11|    14|aaaa|   a|
    | 24| rate|school|    12|    12|  12|  12|
    | 25|score|school|    11|    14|aaaa|   a|
    +---+-----+------+------+------+----+----+
    
    0 讨论(0)
提交回复
热议问题