scala - Spark : How to union all dataframe in loop

前端 未结 6 1678
抹茶落季
抹茶落季 2020-12-14 22:41

Is there a way to get the dataframe that union dataframe in loop?

This is a sample code:

var fruits = List(
  \"apple\"
  ,\"orange\"
  ,\"melon\"
)          


        
相关标签:
6条回答
  • 2020-12-14 22:50

    Steffen Schmitz's answer is the most concise one I believe. Below is a more detailed answer if you are looking for more customization (of field types, etc):

    import org.apache.spark.sql.types.{StructType, StructField, StringType}
    import org.apache.spark.sql.Row
    
    //initialize DF
    val schema = StructType(
      StructField("aCol", StringType, true) ::
      StructField("bCol", StringType, true) ::
      StructField("name", StringType, true) :: Nil)
    var initialDF = spark.createDataFrame(sc.emptyRDD[Row], schema)
    
    //list to iterate through
    var fruits = List(
        "apple"
        ,"orange"
        ,"melon"
    )
    
    for (x <- fruits) {
      //union returns a new dataset
      initialDF = initialDF.union(Seq(("aaa", "bbb", x)).toDF)
    }
    
    //initialDF.show()
    

    references:

    • How to create an empty DataFrame with a specified schema?
    • https://spark.apache.org/docs/2.0.1/api/java/org/apache/spark/sql/Dataset.html
    • https://docs.databricks.com/spark/latest/faq/append-a-row-to-rdd-or-dataframe.html
    0 讨论(0)
  • 2020-12-14 22:54

    In a for loop:

    val fruits = List("apple", "orange", "melon")
    
    ( for(f <- fruits) yield ("aaa", "bbb", f) ).toDF("aCol", "bCol", "name")
    
    0 讨论(0)
  • 2020-12-14 22:57

    If you have different/multiple dataframes you can use below code, which is efficient.

    val newDFs = Seq(DF1,DF2,DF3)
    newDFs.reduce(_ union _)
    
    0 讨论(0)
  • 2020-12-14 23:09

    you can first create a sequence and then use toDF to create Dataframe.

    scala> var dseq : Seq[(String,String,String)] = Seq[(String,String,String)]()
    dseq: Seq[(String, String, String)] = List()
    
    scala> for ( x <- fruits){
         |  dseq = dseq :+ ("aaa","bbb",x)
         | }
    
    scala> dseq
    res2: Seq[(String, String, String)] = List((aaa,bbb,apple), (aaa,bbb,orange), (aaa,bbb,melon))
    
    scala> val df = dseq.toDF("aCol","bCol","name")
    df: org.apache.spark.sql.DataFrame = [aCol: string, bCol: string, name: string]
    
    scala> df.show
    +----+----+------+
    |aCol|bCol|  name|
    +----+----+------+
    | aaa| bbb| apple|
    | aaa| bbb|orange|
    | aaa| bbb| melon|
    +----+----+------+
    
    0 讨论(0)
  • 2020-12-14 23:09

    Well... I think your question is a bit mis-guided.

    As per my limited understanding of whatever you are trying to do, you should be doing following,

    val fruits = List(
      "apple",
      "orange",
      "melon"
    )
    
    val df = fruits
      .map(x => ("aaa", "bbb", x))
      .toDF("aCol", "bCol", "name")
    

    And this should be sufficient.

    0 讨论(0)
  • 2020-12-14 23:11

    You could created a sequence of DataFrames and then use reduce:

    val results = fruits.
      map(fruit => Seq(("aaa", "bbb", fruit)).toDF("aCol","bCol","name")).
      reduce(_.union(_))
    
    results.show()
    
    0 讨论(0)
提交回复
热议问题