How can I create a Spark DataFrame from a nested array of struct element?

前端 未结 3 1085
北恋
北恋 2020-12-24 09:21

I have read a JSON file into Spark. This file has the following structure:

scala> tweetBlob.printSchema
root
 |-- related: struct (nullable = true)
 |             


        
3条回答
  •  攒了一身酷
    2020-12-24 10:04

    One possible way to handle this is to extract required information from the schema. Lets start with some dummy data:

    import org.apache.spark.sql.DataFrame
    import org.apache.spark.sql.types._
    
    
    case class Bar(x: Int, y: String)
    case class Foo(bar: Bar)
    
    val df = sc.parallelize(Seq(Foo(Bar(1, "first")), Foo(Bar(2, "second")))).toDF
    
    df.printSchema
    
    // root
    //  |-- bar: struct (nullable = true)
    //  |    |-- x: integer (nullable = false)
    //  |    |-- y: string (nullable = true)
    

    and a helper function:

    def children(colname: String, df: DataFrame) = {
      val parent = df.schema.fields.filter(_.name == colname).head
      val fields = parent.dataType match {
        case x: StructType => x.fields
        case _ => Array.empty[StructField]
      }
      fields.map(x => col(s"$colname.${x.name}"))
    }
    

    Finally the results:

    df.select(children("bar", df): _*).printSchema
    
    // root
    // |-- x: integer (nullable = true)
    // |-- y: string (nullable = true)
    

提交回复
热议问题