apache-spark-dataset | 易学教程

Convert scala list to DataFrame or DataSet

阅读更多关于 Convert scala list to DataFrame or DataSet

问题 I am new to Scala. I am trying to convert a scala list (which is holding the results of some calculated data on a source DataFrame) to Dataframe or Dataset. I am not finding any direct method to do that. However, I have tried the following process to convert my list to DataSet but it seems not working. I am providing the 3 situations below. Can someone please provide me some ray of hope, how to do this conversion? Thanks. import org.apache.spark.sql.{DataFrame, Row, SQLContext,

Spark Streamming : Reading data from kafka that has multiple schema

阅读更多关于 Spark Streamming : Reading data from kafka that has multiple schema

问题 I am struggling with the implementation in spark streaming. The messages from the kafka looks like this but with with more fields {"event":"sensordata", "source":"sensors", "payload": {"actual data as a json}} {"event":"databasedata", "mysql":"sensors", "payload": {"actual data as a json}} {"event":"eventApi", "source":"event1", "payload": {"actual data as a json}} {"event":"eventapi", "source":"event2", "payload": {"actual data as a json}} I am trying to read the messages from a Kafka topic

How to create a custom Encoder in Spark 2.X Datasets?

阅读更多关于 How to create a custom Encoder in Spark 2.X Datasets?

问题 Spark Datasets move away from Row's to Encoder 's for Pojo's/primitives. The Catalyst engine uses an ExpressionEncoder to convert columns in a SQL expression. However there do not appear to be other subclasses of Encoder available to use as a template for our own implementations. Here is an example of code that is happy in Spark 1.X / DataFrames that does not compile in the new regime: //mapping each row to RDD tuple df.map(row => { var id: String = if (!has_id) "" else row.getAs[String]("id"

Mapping json to case class with Spark (spaces in the field name)

阅读更多关于 Mapping json to case class with Spark (spaces in the field name)

问题 I am trying to read a json file with the spark Dataset API, the problem is that this json contains spaces in some of the field names. This would be a json row {"Field Name" : "value"} My case class needs to be like this case class MyType(`Field Name`: String) Then I can load the file into a DataFrame and it will load the correct schema val dataframe = spark.read.json(path) The problem comes when I try to convert the DataFrame to a Dataset[MyType] dataframe.as[MyType] The StructSchema loaded

Mapping json to case class with Spark (spaces in the field name)

阅读更多关于 Mapping json to case class with Spark (spaces in the field name)

Array Intersection in Spark SQL

阅读更多关于 Array Intersection in Spark SQL

问题 I have a table with a array type column named writer which has the values like array[value1, value2] , array[value2, value3] .... etc. I am doing self join to get results which have common values between arrays. I tried: sqlContext.sql("SELECT R2.writer FROM table R1 JOIN table R2 ON R1.id != R2.id WHERE ARRAY_INTERSECTION(R1.writer, R2.writer)[0] is not null ") And sqlContext.sql("SELECT R2.writer FROM table R1 JOIN table R2 ON R1.id != R2.id WHERE ARRAY_INTERSECT(R1.writer, R2.writer)[0] is

Why do columns change to nullable in Apache Spark SQL?

阅读更多关于 Why do columns change to nullable in Apache Spark SQL?

问题 Why is nullable = true used after some functions are executed even though there are no NaN values in the DataFrame . val myDf = Seq((2,"A"),(2,"B"),(1,"C")) .toDF("foo","bar") .withColumn("foo", 'foo.cast("Int")) myDf.withColumn("foo_2", when($"foo" === 2 , 1).otherwise(0)).select("foo", "foo_2").show When df.printSchema is called now nullable will be false for both columns. val foo: (Int => String) = (t: Int) => { fooMap.get(t) match { case Some(tt) => tt case None => "notFound" } } val

SparkSQL Aggregator: MissingRequirementError

阅读更多关于 SparkSQL Aggregator: MissingRequirementError

问题 I am trying to use Apache Spark's 2.0 Datasets: import org.apache.spark.sql.expressions.Aggregator import org.apache.spark.sql.Encoder import spark.implicits._ case class C1(f1: String, f2: String, f3: String, f4: String, f5: Double) val teams = Seq( C1("hash1", "NLC", "Cubs", "2016-01-23", 3253.21), C1("hash1", "NLC", "Cubs", "2014-01-23", 353.88), C1("hash3", "NLW", "Dodgers", "2013-08-15", 4322.12), C1("hash4", "NLE", "Red Sox", "2010-03-14", 10283.72) ).toDS() val c1Agg = new Aggregator

Spark Encoders: when to use beans()

阅读更多关于 Spark Encoders: when to use beans()

问题 I came across a memory management problem while using Spark's caching mechanism. I am currently utilizing Encoder s with Kryo and was wondering if switching to beans would help me reduce the size of my cached dataset. Basically, what are the pros and cons of using beans over Kryo serialization when working with Encoder s? Are there any performance improvements? Is there a way to compress a cached Dataset apart from caching with SER option? For the record, I have found a similar topic that

SortedMap non serializable error in Spark Dataset

阅读更多关于 SortedMap non serializable error in Spark Dataset

问题 It seems like scala.collection.SortedMap is not serializable? Simple code example: case class MyClass(s: scala.collection.SortedMap[String, String] = SortedMap[String, String]()) object MyClass { def apply(i: Int): MyClass = MyClass() } import sparkSession.implicits._ List(MyClass(1), MyClass()).toDS().show(2) Will return: +-----+ | s| +-----+ |Map()| |Map()| +-----+ On the other hand, take() will fail miserably at execution time: List(MyClass(1), MyClass()).toDS().take(2) ERROR codegen