How to convert Row of a Scala DataFrame into case class most efficiently?

怎甘沉沦 提交于 2019-12-18 10:16:47

问题


Once I have got in Spark some Row class, either Dataframe or Catalyst, I want to convert it to a case class in my code. This can be done by matching

someRow match {case Row(a:Long,b:String,c:Double) => myCaseClass(a,b,c)}

But it becomes ugly when the row has a huge number of columns, say a dozen of Doubles, some Booleans and even the occasional null.

I would like just to be able to -sorry- cast Row to myCaseClass. Is it possible, or have I already got the most economical syntax?


回答1:


DataFrame is simply a type alias of Dataset[Row] . These operations are also referred as “untyped transformations” in contrast to “typed transformations” that come with strongly typed Scala/Java Datasets.

The conversion from Dataset[Row] to Dataset[Person] is very simple in spark

val DFtoProcess = SQLContext.sql("SELECT * FROM peoples WHERE name='test'")

At this point, Spark converts your data into DataFrame = Dataset[Row], a collection of generic Row object, since it does not know the exact type.

// Create an Encoders for Java class (In my eg. Person is a JAVA class)
// For scala case class you can pass Person without .class reference
val personEncoder = Encoders.bean(Person.class) 

val DStoProcess = DFtoProcess.as[Person](personEncoder)

Now, Spark converts the Dataset[Row] -> Dataset[Person] type-specific Scala / Java JVM object, as dictated by the class Person.

Please refer to below link provided by databricks for further details

https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html




回答2:


As far as I know you cannot cast a Row to a case class, but I sometimes chose to access the row fields directly, like

map(row => myCaseClass(row.getLong(0), row.getString(1), row.getDouble(2))

I find this to be easier, especially if the case class constructor only needs some of the fields from the row.




回答3:


scala> import spark.implicits._    
scala> val df = Seq((1, "james"), (2, "tony")).toDF("id", "name")
df: org.apache.spark.sql.DataFrame = [id: int, name: string]

scala> case class Student(id: Int, name: String)
defined class Student

scala> df.as[Student].collectAsList
res6: java.util.List[Student] = [Student(1,james), Student(2,tony)]

Here the spark in spark.implicits._ is your SparkSession. If you are inside the REPL the session is already defined as spark otherwise you need to adjust the name accordingly to correspond to your SparkSession.




回答4:


Of course you can match a Row object into a case class. Let's suppose your SchemaType has many fields and you want to match a few of them into your case class. If you don't have null fields you can simply do:

case class MyClass(a: Long, b: String, c: Int, d: String, e: String)

dataframe.map {
  case Row(a: java.math.BigDecimal, 
    b: String, 
    c: Int, 
    _: String,
    _: java.sql.Date, 
    e: java.sql.Date,
    _: java.sql.Timestamp, 
    _: java.sql.Timestamp, 
    _: java.math.BigDecimal, 
    _: String) => MyClass(a = a.longValue(), b = b, c = c, d = d.toString, e = e.toString)
}

This approach will fail in case of null values and also require you do explicitly define the type of each single field. If you have to handle null values you need to either discard all the rows containing null values by doing

dataframe.na.drop()

That will drop records even if the null fields are not the ones used in your pattern matching for your case class. Or if you want to handle it you could turn the Row object into a List and then use the option pattern:

case class MyClass(a: Long, b: String, c: Option[Int], d: String, e: String)

dataframe.map(_.toSeq.toList match {
  case List(a: java.math.BigDecimal, 
    b: String, 
    c: Int, 
    _: String,
    _: java.sql.Date, 
    e: java.sql.Date,
    _: java.sql.Timestamp, 
    _: java.sql.Timestamp, 
    _: java.math.BigDecimal, 
    _: String) => MyClass(
      a = a.longValue(), b = b, c = Option(c), d = d.toString, e = e.toString)
}

Check this github project Sparkz () which will soon introduce a lot of libraries for simplifying the Spark and DataFrame APIs and make them more functional programming oriented.



来源:https://stackoverflow.com/questions/28166555/how-to-convert-row-of-a-scala-dataframe-into-case-class-most-efficiently

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!