apache-spark-dataset

Convert scala list to DataFrame or DataSet

夙愿已清 提交于 2020-01-02 03:00:15
问题 I am new to Scala. I am trying to convert a scala list (which is holding the results of some calculated data on a source DataFrame) to Dataframe or Dataset. I am not finding any direct method to do that. However, I have tried the following process to convert my list to DataSet but it seems not working. I am providing the 3 situations below. Can someone please provide me some ray of hope, how to do this conversion? Thanks. import org.apache.spark.sql.{DataFrame, Row, SQLContext,

Spark Streamming : Reading data from kafka that has multiple schema

五迷三道 提交于 2020-01-01 06:32:33
问题 I am struggling with the implementation in spark streaming. The messages from the kafka looks like this but with with more fields {"event":"sensordata", "source":"sensors", "payload": {"actual data as a json}} {"event":"databasedata", "mysql":"sensors", "payload": {"actual data as a json}} {"event":"eventApi", "source":"event1", "payload": {"actual data as a json}} {"event":"eventapi", "source":"event2", "payload": {"actual data as a json}} I am trying to read the messages from a Kafka topic

How to create a custom Encoder in Spark 2.X Datasets?

筅森魡賤 提交于 2019-12-31 12:21:05
问题 Spark Datasets move away from Row's to Encoder 's for Pojo's/primitives. The Catalyst engine uses an ExpressionEncoder to convert columns in a SQL expression. However there do not appear to be other subclasses of Encoder available to use as a template for our own implementations. Here is an example of code that is happy in Spark 1.X / DataFrames that does not compile in the new regime: //mapping each row to RDD tuple df.map(row => { var id: String = if (!has_id) "" else row.getAs[String]("id"

Mapping json to case class with Spark (spaces in the field name)

ⅰ亾dé卋堺 提交于 2019-12-30 11:02:39
问题 I am trying to read a json file with the spark Dataset API, the problem is that this json contains spaces in some of the field names. This would be a json row {"Field Name" : "value"} My case class needs to be like this case class MyType(`Field Name`: String) Then I can load the file into a DataFrame and it will load the correct schema val dataframe = spark.read.json(path) The problem comes when I try to convert the DataFrame to a Dataset[MyType] dataframe.as[MyType] The StructSchema loaded

Mapping json to case class with Spark (spaces in the field name)

纵饮孤独 提交于 2019-12-30 11:01:30
问题 I am trying to read a json file with the spark Dataset API, the problem is that this json contains spaces in some of the field names. This would be a json row {"Field Name" : "value"} My case class needs to be like this case class MyType(`Field Name`: String) Then I can load the file into a DataFrame and it will load the correct schema val dataframe = spark.read.json(path) The problem comes when I try to convert the DataFrame to a Dataset[MyType] dataframe.as[MyType] The StructSchema loaded

Array Intersection in Spark SQL

旧巷老猫 提交于 2019-12-29 08:03:10
问题 I have a table with a array type column named writer which has the values like array[value1, value2] , array[value2, value3] .... etc. I am doing self join to get results which have common values between arrays. I tried: sqlContext.sql("SELECT R2.writer FROM table R1 JOIN table R2 ON R1.id != R2.id WHERE ARRAY_INTERSECTION(R1.writer, R2.writer)[0] is not null ") And sqlContext.sql("SELECT R2.writer FROM table R1 JOIN table R2 ON R1.id != R2.id WHERE ARRAY_INTERSECT(R1.writer, R2.writer)[0] is

Why do columns change to nullable in Apache Spark SQL?

依然范特西╮ 提交于 2019-12-28 06:47:13
问题 Why is nullable = true used after some functions are executed even though there are no NaN values in the DataFrame . val myDf = Seq((2,"A"),(2,"B"),(1,"C")) .toDF("foo","bar") .withColumn("foo", 'foo.cast("Int")) myDf.withColumn("foo_2", when($"foo" === 2 , 1).otherwise(0)).select("foo", "foo_2").show When df.printSchema is called now nullable will be false for both columns. val foo: (Int => String) = (t: Int) => { fooMap.get(t) match { case Some(tt) => tt case None => "notFound" } } val

SparkSQL Aggregator: MissingRequirementError

浪子不回头ぞ 提交于 2019-12-25 07:14:08
问题 I am trying to use Apache Spark's 2.0 Datasets: import org.apache.spark.sql.expressions.Aggregator import org.apache.spark.sql.Encoder import spark.implicits._ case class C1(f1: String, f2: String, f3: String, f4: String, f5: Double) val teams = Seq( C1("hash1", "NLC", "Cubs", "2016-01-23", 3253.21), C1("hash1", "NLC", "Cubs", "2014-01-23", 353.88), C1("hash3", "NLW", "Dodgers", "2013-08-15", 4322.12), C1("hash4", "NLE", "Red Sox", "2010-03-14", 10283.72) ).toDS() val c1Agg = new Aggregator

Spark Encoders: when to use beans()

巧了我就是萌 提交于 2019-12-24 11:16:13
问题 I came across a memory management problem while using Spark's caching mechanism. I am currently utilizing Encoder s with Kryo and was wondering if switching to beans would help me reduce the size of my cached dataset. Basically, what are the pros and cons of using beans over Kryo serialization when working with Encoder s? Are there any performance improvements? Is there a way to compress a cached Dataset apart from caching with SER option? For the record, I have found a similar topic that

SortedMap non serializable error in Spark Dataset

我与影子孤独终老i 提交于 2019-12-24 10:47:34
问题 It seems like scala.collection.SortedMap is not serializable? Simple code example: case class MyClass(s: scala.collection.SortedMap[String, String] = SortedMap[String, String]()) object MyClass { def apply(i: Int): MyClass = MyClass() } import sparkSession.implicits._ List(MyClass(1), MyClass()).toDS().show(2) Will return: +-----+ | s| +-----+ |Map()| |Map()| +-----+ On the other hand, take() will fail miserably at execution time: List(MyClass(1), MyClass()).toDS().take(2) ERROR codegen