How to create a Dataset from custom class Person?

匿名 (未验证) 提交于 2019-12-03 07:36:14

问题:

I was trying to create a Dataset in Java, so I write the following code:

public Dataset createDataset(){   List<Person> list = new ArrayList<>();   list.add(new Person("name", 10, 10.0));   Dataset<Person> dateset = sqlContext.createDataset(list, Encoders.bean(Person.class));   return dataset; } 

Person class is an inner class.

Spark however throws the following exception:

org.apache.spark.sql.AnalysisException: Unable to generate an encoder for inner class `....` without access to the scope that this class was defined in. Try moving this class out of its parent class.;  at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$$anonfun$2.applyOrElse(ExpressionEncoder.scala:264) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$$anonfun$2.applyOrElse(ExpressionEncoder.scala:260) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:243) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:243) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:242) 

How to do it properly?

回答1:

tl;dr (Only in Spark shell) Define your case classes first and, once they are defined, use them. Using case classes in Spark/Scala applications should just work.

In 2.0.1 in Spark shell you should define case classes first and only then access them to create a Dataset.

$ ./bin/spark-shell --version Welcome to       ____              __      / __/__  ___ _____/ /__     _\ \/ _ \/ _ `/ __/  '_/    /___/ .__/\_,_/_/ /_/\_\   version 2.1.0-SNAPSHOT       /_/  Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_102 Branch master Compiled by user jacek on 2016-10-25T04:20:04Z Revision 483c37c581fedc64b218e294ecde1a7bb4b2af9c Url https://github.com/apache/spark.git Type --help for more information.  $ ./bin/spark-shell scala> :pa // Entering paste mode (ctrl-D to finish)  case class Person(id: Long)  Seq(Person(0)).toDS // <-- this won't work  // Exiting paste mode, now interpreting.  <console>:15: error: value toDS is not a member of Seq[Person]        Seq(Person(0)).toDS // <-- it won't work                       ^ scala> case class Person(id: Long) defined class Person  scala> // the following implicit conversion *will* work  scala> Seq(Person(0)).toDS res1: org.apache.spark.sql.Dataset[Person] = [id: bigint] 

In 43ebf7a9cbd70d6af75e140a6fc91bf0ffc2b877 commit (Spark 2.0.0-SNAPSHOT at March 21st) the solution was added to work around the issue.

In Scala REPL I had to add OuterScopes.addOuterScope(this) while :paste the complete snippet as follows:

scala> :pa // Entering paste mode (ctrl-D to finish)  import sqlContext.implicits._ case class Token(name: String, productId: Int, score: Double) val data = Token("aaa", 100, 0.12) ::   Token("aaa", 200, 0.29) ::   Token("bbb", 200, 0.53) ::   Token("bbb", 300, 0.42) :: Nil org.apache.spark.sql.catalyst.encoders.OuterScopes.addOuterScope(this) val ds = data.toDS 


回答2:

The solution was to add this piece of code at the start of the method:

org.apache.spark.sql.catalyst.encoders.OuterScopes.addOuterScope(this); 


回答3:

For a similar issue in scala, my solution was to to do exactly as the AnalysisException suggested. Moving the case class out of its parent class. For example I had something like below in Streaming_Base.scala:

abstract class Streaming_Base {     case class EventBean(id:String, command:String, recordType:String)     ... } 

I changed that to below:

case class EventBean(id:String, command:String, recordType:String) abstract class Streaming_Base {             ... } 


标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!