apache-spark-dataset

How to add a Map column to Spark dataset?

做~自己de王妃 提交于 2021-02-08 09:15:43
问题 I have a Java Map variable, say Map<String, String> singleColMap . I want to add this Map variable to a dataset as a new column value in Spark 2.2 (Java 1.8). I tried the below code but it is not working: ds.withColumn("cMap", lit(singleColMap).cast(MapType(StringType, StringType))) Can some one help on this? 回答1: You can use typedLit that was introducted in Spark 2.2.0 , from the documentation: The difference between this function and lit is that this function can handle parameterized scala

Spark Dataset : data transformation

旧城冷巷雨未停 提交于 2021-02-07 10:19:06
问题 I have a Spark Dataset of the format - +--------------+--------+-----+ |name |type |cost | +--------------+--------+-----+ |AAAAAAAAAAAAAA|XXXXX |0.24| |AAAAAAAAAAAAAA|YYYYY |1.14| |BBBBBBBBBBBBBB|XXXXX |0.78| |BBBBBBBBBBBBBB|YYYYY |2.67| |BBBBBBBBBBBBBB|ZZZZZ |0.15| |CCCCCCCCCCCCCC|XXXXX |1.86| |CCCCCCCCCCCCCC|YYYYY |1.50| |CCCCCCCCCCCCCC|ZZZZZ |1.00| +--------------+--------+----+ I want to transform this into an object of type - public class CostPerName { private String name; private Map

Spark Dataset : data transformation

泪湿孤枕 提交于 2021-02-07 10:18:18
问题 I have a Spark Dataset of the format - +--------------+--------+-----+ |name |type |cost | +--------------+--------+-----+ |AAAAAAAAAAAAAA|XXXXX |0.24| |AAAAAAAAAAAAAA|YYYYY |1.14| |BBBBBBBBBBBBBB|XXXXX |0.78| |BBBBBBBBBBBBBB|YYYYY |2.67| |BBBBBBBBBBBBBB|ZZZZZ |0.15| |CCCCCCCCCCCCCC|XXXXX |1.86| |CCCCCCCCCCCCCC|YYYYY |1.50| |CCCCCCCCCCCCCC|ZZZZZ |1.00| +--------------+--------+----+ I want to transform this into an object of type - public class CostPerName { private String name; private Map

Spark : Create dataframe with default values

ⅰ亾dé卋堺 提交于 2021-01-29 11:22:04
问题 Can we put a default value in a field of dataframe while creating the dataframe? I am creating a spark dataframe from List<Object[]> rows as : List<org.apache.spark.sql.Row> sparkRows = rows.stream().map(RowFactory::create).collect(Collectors.toList()); Dataset<org.apache.spark.sql.Row> dataset = session.createDataFrame(sparkRows, schema); While looking for a way, I found that org.apache.spark.sql.types.DataTypes contains object of org.apache.spark.sql.types.Metadata class. The documentation

How to store nested custom objects in Spark Dataset?

a 夏天 提交于 2021-01-29 07:48:20
问题 The question is a follow-up of How to store custom objects in Dataset? Spark version: 3.0.1 Non-nested custom type is achievable: import spark.implicits._ import org.apache.spark.sql.{Encoder, Encoders} class AnObj(val a: Int, val b: String) implicit val myEncoder: Encoder[AnObj] = Encoders.kryo[AnObj] val d = spark.createDataset(Seq(new AnObj(1, "a"))) d.printSchema root |-- value: binary (nullable = true) However, if the custom type is nested inside a product type (i.e. case class ), it

How to Split rows to different columns in Spark DataFrame/DataSet?

老子叫甜甜 提交于 2021-01-28 02:18:31
问题 Suppose I have data set like : Name | Subject | Y1 | Y2 A | math | 1998| 2000 B | | 1996| 1999 | science | 2004| 2005 I want to split rows of this data set such that Y2 column will be eliminated like : Name | Subject | Y1 A | math | 1998 A | math | 1999 A | math | 2000 B | | 1996 B | | 1997 B | | 1998 B | | 1999 | science | 2004 | science | 2005 Can someone suggest something here ? I hope I had made my query clear. Thanks in advance. 回答1: I think you only need to create an udf to create the

Spark Error: Unable to find encoder for type stored in a Dataset

China☆狼群 提交于 2021-01-27 07:50:22
问题 I am using Spark on a Zeppelin notebook, and groupByKey() does not seem to be working. This code: df.groupByKey(row => row.getLong(0)) .mapGroups((key, iterable) => println(key)) Gives me this error (presumably a compilation error, since it shows up in no time while the dataset I am working on is pretty big): error: Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for

Spark Error: Unable to find encoder for type stored in a Dataset

霸气de小男生 提交于 2021-01-27 07:50:16
问题 I am using Spark on a Zeppelin notebook, and groupByKey() does not seem to be working. This code: df.groupByKey(row => row.getLong(0)) .mapGroups((key, iterable) => println(key)) Gives me this error (presumably a compilation error, since it shows up in no time while the dataset I am working on is pretty big): error: Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for

Spark: Dataset Serialization

落花浮王杯 提交于 2020-12-28 23:50:08
问题 If I have a dataset each record of which is a case class, and I persist that dataset as shown below so that serialization is used: myDS.persist(StorageLevel.MERORY_ONLY_SER) Does Spark use java/kyro serialization to serialize the dataset? or just like dataframe, Spark has its own way of storing the data in the dataset? 回答1: Spark Dataset does not use standard serializers. Instead it uses Encoders , which "understand" internal structure of the data and can efficiently transform objects

Spark: Dataset Serialization

早过忘川 提交于 2020-12-28 23:44:10
问题 If I have a dataset each record of which is a case class, and I persist that dataset as shown below so that serialization is used: myDS.persist(StorageLevel.MERORY_ONLY_SER) Does Spark use java/kyro serialization to serialize the dataset? or just like dataframe, Spark has its own way of storing the data in the dataset? 回答1: Spark Dataset does not use standard serializers. Instead it uses Encoders , which "understand" internal structure of the data and can efficiently transform objects