I have JSON data in the following format:
{ "date": 100 "userId": 1 "data": [ { "timeStamp": 101, "reading": 1 }, { "timeStamp": 102, "reading": 2 } ] } { "date": 200 "userId": 1 "data": [ { "timeStamp": 201, "reading": 3 }, { "timeStamp": 202, "reading": 4 } ] } I read it into Spark SQL:
val df = SQLContext.read.json(...) df.printSchema // root // |-- date: double (nullable = true) // |-- userId: long (nullable = true) // |-- data: array (nullable = true) // | |-- element: struct (containsNull = true) // | | |-- timeStamp: double (nullable = true) // | | |-- reading: double (nullable = true) I would like to transform it in order to have one row per reading. To my understanding, every transformation should produce a new DataFrame, so the following should work:
import org.apache.spark.sql.functions.explode val exploded = df .withColumn("reading", explode(df("data.reading"))) .withColumn("timeStamp", explode(df("data.timeStamp"))) .drop("data") exploded.printSchema // root // |-- date: double (nullable = true) // |-- userId: long (nullable = true) // |-- timeStamp: double (nullable = true) // |-- reading: double (nullable = true) The resulting schema is correct, but I get every value twice:
exploded.show // +-----------+-----------+-----------+-----------+ // | date| userId| timeStamp| reading| // +-----------+-----------+-----------+-----------+ // | 100| 1| 101| 1| // | 100| 1| 101| 1| // | 100| 1| 102| 2| // | 100| 1| 102| 2| // | 200| 1| 201| 3| // | 200| 1| 201| 3| // | 200| 1| 202| 4| // | 200| 1| 202| 4| // +-----------+-----------+-----------+-----------+ My feeling is that there is something about the lazy evaluation of the two explodes that I don't understand.
Is there a way to get the above code to work? Or should I use a different approach all together?