apache-spark-dataset

How to create a Dataset from custom class Person?

北战南征 提交于 2019-12-04 23:43:39
问题 I was trying to create a Dataset in Java, so I write the following code: public Dataset createDataset(){ List<Person> list = new ArrayList<>(); list.add(new Person("name", 10, 10.0)); Dataset<Person> dateset = sqlContext.createDataset(list, Encoders.bean(Person.class)); return dataset; } Person class is an inner class. Spark however throws the following exception: org.apache.spark.sql.AnalysisException: Unable to generate an encoder for inner class `....` without access to the scope that this

Generic iterator over dataframe (Spark/scala)

不羁岁月 提交于 2019-12-04 19:53:05
I need to iterate over data frame in specific order and apply some complex logic to calculate new column. In below example I'll be using simple expression where current value for s is multiplication of all previous values thus it may seem like this can be done using UDF or even analytic functions. However, in reality logic is much more complex. Below code does what is needed import org.apache.spark.sql.Row import org.apache.spark.sql.types._ import org.apache.spark.sql.catalyst.encoders.RowEncoder val q = """ select 10 x, 1 y union all select 10, 2 union all select 10, 3 union all select 20, 6

Spark 2 Dataset Null value exception

我的未来我决定 提交于 2019-12-04 19:17:48
问题 Getting this null error in spark Dataset.filter Input CSV: name,age,stat abc,22,m xyz,,s Working code: case class Person(name: String, age: Long, stat: String) val peopleDS = spark.read.option("inferSchema","true") .option("header", "true").option("delimiter", ",") .csv("./people.csv").as[Person] peopleDS.show() peopleDS.createOrReplaceTempView("people") spark.sql("select * from people where age > 30").show() Failing code ( Adding following lines return error ): val filteredDS = peopleDS

Spark DataSet filter performance

扶醉桌前 提交于 2019-12-04 11:52:09
问题 I have been experimenting different ways to filter a typed data set. It turns out the performance can be quite different. The data set was created based on a 1.6 GB rows of data with 33 columns and 4226047 rows. DataSet is created by loading csv data and mapped to a case class. val df = spark.read.csv(csvFile).as[FireIncident] A filter on UnitId = 'B02' should return 47980 rows. I tested three ways as below: 1) Use typed column (~ 500 ms on local host) df.where($"UnitID" === "B02").count() 2)

Spark Dataset select with typedcolumn

若如初见. 提交于 2019-12-04 10:32:47
问题 Looking at the select() function on the spark DataSet there are various generated function signatures: (c1: TypedColumn[MyClass, U1],c2: TypedColumn[MyClass, U2] ....) This seems to hint that I should be able to reference the members of MyClass directly and be type safe, but I'm not sure how... ds.select("member") of course works .. seems like ds.select(_.member) might also work somehow? 回答1: In the Scala DSL for select , there are many ways to identify a Column : From a symbol: 'name From a

scala generic encoder for spark case class

对着背影说爱祢 提交于 2019-12-04 08:30:44
How can I get this method to compile. Strangely, sparks implicit are already imported. def loadDsFromHive[T <: Product](tableName: String, spark: SparkSession): Dataset[T] = { import spark.implicits._ spark.sql(s"SELECT * FROM $tableName").as[T] } This is the error: Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases. [error] spark.sql(s"SELECT * FROM $tableName").as[T] According to the source code for org.apache

Apache spark join with dynamic re-partitionion

五迷三道 提交于 2019-12-04 06:10:05
问题 I'm trying to do a fairly straightforward join on two tables, nothing complicated. Load both tables, do a join and update columns but it keeps throwing an exception. I noticed the task is stuck on the last partition 199/200 and eventually crashes. My suspicion is that the data is skewed causing all the data to be loaded in the last partition 199 . SELECT COUNT(DISTINCT report_audit) FROM ReportDs = 1.5million. While SELECT COUNT(*) FROM ReportDs = 57million. Cluster details CPU: 40 cores

How to find first non-null values in groups? (secondary sorting using dataset api)

蹲街弑〆低调 提交于 2019-12-04 04:58:57
I am working on a dataset which represents a stream of events (like fired as tracking events from a website). All the events have a timestamp. One use case we often have is trying to find the 1st non null value for a given field. So for example something like gets us most the way there: val eventsDf = spark.read.json(jsonEventsPath) case class ProjectedFields(visitId: String, userId: Int, timestamp: Long ... ) val projectedEventsDs = eventsDf.select( eventsDf("message.visit.id").alias("visitId"), eventsDf("message.property.user_id").alias("userId"), eventsDf("message.property.timestamp"), ...

Spark Streamming : Reading data from kafka that has multiple schema

丶灬走出姿态 提交于 2019-12-03 17:17:58
I am struggling with the implementation in spark streaming. The messages from the kafka looks like this but with with more fields {"event":"sensordata", "source":"sensors", "payload": {"actual data as a json}} {"event":"databasedata", "mysql":"sensors", "payload": {"actual data as a json}} {"event":"eventApi", "source":"event1", "payload": {"actual data as a json}} {"event":"eventapi", "source":"event2", "payload": {"actual data as a json}} I am trying to read the messages from a Kafka topic (which has multiple schemas). I need to read each message and look for an event and source field and

Spark 2 Dataset Null value exception

一笑奈何 提交于 2019-12-03 12:31:09
Getting this null error in spark Dataset.filter Input CSV: name,age,stat abc,22,m xyz,,s Working code: case class Person(name: String, age: Long, stat: String) val peopleDS = spark.read.option("inferSchema","true") .option("header", "true").option("delimiter", ",") .csv("./people.csv").as[Person] peopleDS.show() peopleDS.createOrReplaceTempView("people") spark.sql("select * from people where age > 30").show() Failing code ( Adding following lines return error ): val filteredDS = peopleDS.filter(_.age > 30) filteredDS.show() Returns null error java.lang.RuntimeException: Null value appeared in