I am trying to convert a simple DataFrame to a DataSet from the example in Spark: https://spark.apache.org/docs/latest/sql-programming-guide.html
case class
This is how you create dataset from case class
case class Person(name: String, age: Long)
Keep the case class outside of the class that has below code
val primitiveDS = Seq(1,2,3).toDS()
val augmentedDS = primitiveDS.map(i => Person("var_" + i.toString, (i + 1).toLong))
augmentedDS.show()
augmentedDS.as[Person].show()
Hope this helped
If you change Int to Long (or BigInt) it works fine:
case class Person(name: String, age: Long)
import spark.implicits._
val path = "examples/src/main/resources/people.json"
val peopleDS = spark.read.json(path).as[Person]
peopleDS.show()
Output:
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
EDIT:
Spark.read.json
by default parses numbers as Long
types - it's safer to do so.
You can change the col type after using casting or udfs.
EDIT2:
To answer your 2nd question, you need to name the columns correctly before the conversion to Person will work:
val primitiveDS = Seq(1,2,3).toDS()
val augmentedDS = primitiveDS.map(i => ("var_" + i.toString, (i + 1).toLong)).
withColumnRenamed ("_1", "name" ).
withColumnRenamed ("_2", "age" )
augmentedDS.as[Person].show()
Outputs:
+-----+---+
| name|age|
+-----+---+
|var_1| 2|
|var_2| 3|
|var_3| 4|
+-----+---+