Extract column values of Dataframe as List in Apache Spark

后端 未结 10 1050
慢半拍i
慢半拍i 2020-12-22 16:52

I want to convert a string column of a data frame to a list. What I can find from the Dataframe API is RDD, so I tried converting it back to RDD first, and then

10条回答
  •  春和景丽
    2020-12-22 17:39

    With Spark 2.x and Scala 2.11

    I'd think of 3 possible ways to convert values of a specific column to List.

    Common code snippets for all the approaches

    import org.apache.spark.sql.SparkSession
    
    val spark = SparkSession.builder.getOrCreate    
    import spark.implicits._ // for .toDF() method
    
    val df = Seq(
        ("first", 2.0),
        ("test", 1.5), 
        ("choose", 8.0)
      ).toDF("id", "val")
    

    Approach 1

    df.select("id").collect().map(_(0)).toList
    // res9: List[Any] = List(one, two, three)
    

    What happens now? We are collecting data to Driver with collect() and picking element zero from each record.

    This could not be an excellent way of doing it, Let's improve it with next approach.


    Approach 2

    df.select("id").rdd.map(r => r(0)).collect.toList 
    //res10: List[Any] = List(one, two, three)
    

    How is it better? We have distributed map transformation load among the workers rather than single Driver.

    I know rdd.map(r => r(0)) does not seems elegant you. So, let's address it in next approach.


    Approach 3

    df.select("id").map(r => r.getString(0)).collect.toList 
    //res11: List[String] = List(one, two, three)
    

    Here we are not converting DataFrame to RDD. Look at map it won't accept r => r(0)(or _(0)) as the previous approach due to encoder issues in DataFrame. So end up using r => r.getString(0) and it would be addressed in the next versions of Spark.

    Conclusion

    All the options give the same output but 2 and 3 are effective, finally 3rd one is effective and elegant(I'd think).

    Databricks notebook

提交回复
热议问题