apache-spark-dataset

Encode an ADT / sealed trait hierarchy into Spark DataSet column

懵懂的女人 提交于 2019-11-26 14:46:04
问题 If I want to store an Algebraic Data Type (ADT) (ie a Scala sealed trait hierarchy) within a Spark DataSet column, what is the best encoding strategy? For example, if I have an ADT where the leaf types store different kinds of data: sealed trait Occupation case object SoftwareEngineer extends Occupation case class Wizard(level: Int) extends Occupation case class Other(description: String) extends Occupation Whats the best way to construct a: org.apache.spark.sql.DataSet[Occupation] 回答1: TL;DR

Why is “Unable to find encoder for type stored in a Dataset” when creating a dataset of custom case class?

风流意气都作罢 提交于 2019-11-26 14:34:11
Spark 2.0 (final) with Scala 2.11.8. The following super simple code yields the compilation error Error:(17, 45) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases. import org.apache.spark.sql.SparkSession case class SimpleTuple(id: Int, desc: String) object DatasetTest { val dataList = List( SimpleTuple(5, "abc"), SimpleTuple(6, "bcd") ) def main(args: Array[String]): Unit = { val sparkSession = SparkSession

Perform a typed join in Scala with Spark Datasets

社会主义新天地 提交于 2019-11-26 12:59:10
问题 I like Spark Datasets as they give me analysis errors and syntax errors at compile time and also allow me to work with getters instead of hard-coded names/numbers. Most computations can be accomplished with Dataset’s high-level APIs. For example, it’s much simpler to perform agg, select, sum, avg, map, filter, or groupBy operations by accessing a Dataset typed object’s than using RDD rows’ data fields. However the join operation is missing from this, I read that I can do a join like this ds1

Difference between DataFrame, Dataset, and RDD in Spark

南楼画角 提交于 2019-11-26 04:56:50
问题 I\'m just wondering what is the difference between an RDD and DataFrame (Spark 2.0.0 DataFrame is a mere type alias for Dataset[Row] ) in Apache Spark? Can you convert one to the other? 回答1: A DataFrame is defined well with a google search for "DataFrame definition": A data frame is a table, or two-dimensional array-like structure, in which each column contains measurements on one variable, and each row contains one case. So, a DataFrame has additional metadata due to its tabular format,

Difference between DataFrame, Dataset, and RDD in Spark

孤人 提交于 2019-11-25 23:50:58
问题 I\'m just wondering what is the difference between an RDD and DataFrame (Spark 2.0.0 DataFrame is a mere type alias for Dataset[Row] ) in Apache Spark? Can you convert one to the other? 回答1: A DataFrame is defined well with a google search for "DataFrame definition": A data frame is a table, or two-dimensional array-like structure, in which each column contains measurements on one variable, and each row contains one case. So, a DataFrame has additional metadata due to its tabular format,

DataFrame / Dataset groupBy behaviour/optimization

邮差的信 提交于 2019-11-25 23:49:25
问题 Suppose we have DataFrame df consisting of the following columns: Name, Surname, Size, Width, Length, Weigh Now we want to perform a couple of operations, for example we want to create a couple of DataFrames containing data about Size and Width. val df1 = df.groupBy(\"surname\").agg( sum(\"size\") ) val df2 = df.groupBy(\"surname\").agg( sum(\"width\") ) as you can notice, other columns, like Length are not used anywhere. Is Spark smart enough to drop the redundant columns before the

How to store custom objects in Dataset?

丶灬走出姿态 提交于 2019-11-25 22:30:49
问题 According to Introducing Spark Datasets: As we look forward to Spark 2.0, we plan some exciting improvements to Datasets, specifically: ... Custom encoders – while we currently autogenerate encoders for a wide variety of types, we’d like to open up an API for custom objects. and attempts to store custom type in a Dataset lead to following error like: Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by

Encoder error while trying to map dataframe row to updated row

扶醉桌前 提交于 2019-11-25 21:43:52
When I m trying to do the same thing in my code as mentioned below dataframe.map(row => { val row1 = row.getAs[String](1) val make = if (row1.toLowerCase == "tesla") "S" else row1 Row(row(0),make,row(2)) }) I have taken the above reference from here: Scala: How can I replace value in Dataframs using scala But I am getting encoder error as Unable to find encoder for type stored in a Dataset. Primitive types (Int, S tring, etc) and Product types (case classes) are supported by importing spark.im plicits._ Support for serializing other types will be added in future releases. Note: I am using