apache-spark-dataset | 易学教程

How to create a Dataset from custom class Person?

阅读更多关于 How to create a Dataset from custom class Person?

问题 I was trying to create a Dataset in Java, so I write the following code: public Dataset createDataset(){ List<Person> list = new ArrayList<>(); list.add(new Person("name", 10, 10.0)); Dataset<Person> dateset = sqlContext.createDataset(list, Encoders.bean(Person.class)); return dataset; } Person class is an inner class. Spark however throws the following exception: org.apache.spark.sql.AnalysisException: Unable to generate an encoder for inner class `....` without access to the scope that this

Generic iterator over dataframe (Spark/scala)

阅读更多关于 Generic iterator over dataframe (Spark/scala)

I need to iterate over data frame in specific order and apply some complex logic to calculate new column. In below example I'll be using simple expression where current value for s is multiplication of all previous values thus it may seem like this can be done using UDF or even analytic functions. However, in reality logic is much more complex. Below code does what is needed import org.apache.spark.sql.Row import org.apache.spark.sql.types._ import org.apache.spark.sql.catalyst.encoders.RowEncoder val q = """ select 10 x, 1 y union all select 10, 2 union all select 10, 3 union all select 20, 6

Spark 2 Dataset Null value exception

阅读更多关于 Spark 2 Dataset Null value exception

问题 Getting this null error in spark Dataset.filter Input CSV: name,age,stat abc,22,m xyz,,s Working code: case class Person(name: String, age: Long, stat: String) val peopleDS = spark.read.option("inferSchema","true") .option("header", "true").option("delimiter", ",") .csv("./people.csv").as[Person] peopleDS.show() peopleDS.createOrReplaceTempView("people") spark.sql("select * from people where age > 30").show() Failing code ( Adding following lines return error ): val filteredDS = peopleDS

Spark DataSet filter performance

阅读更多关于 Spark DataSet filter performance

问题 I have been experimenting different ways to filter a typed data set. It turns out the performance can be quite different. The data set was created based on a 1.6 GB rows of data with 33 columns and 4226047 rows. DataSet is created by loading csv data and mapped to a case class. val df = spark.read.csv(csvFile).as[FireIncident] A filter on UnitId = 'B02' should return 47980 rows. I tested three ways as below: 1) Use typed column (~ 500 ms on local host) df.where($"UnitID" === "B02").count() 2)

Spark Dataset select with typedcolumn

阅读更多关于 Spark Dataset select with typedcolumn

问题 Looking at the select() function on the spark DataSet there are various generated function signatures: (c1: TypedColumn[MyClass, U1],c2: TypedColumn[MyClass, U2] ....) This seems to hint that I should be able to reference the members of MyClass directly and be type safe, but I'm not sure how... ds.select("member") of course works .. seems like ds.select(_.member) might also work somehow? 回答1: In the Scala DSL for select , there are many ways to identify a Column : From a symbol: 'name From a

scala generic encoder for spark case class

阅读更多关于 scala generic encoder for spark case class

How can I get this method to compile. Strangely, sparks implicit are already imported. def loadDsFromHive[T <: Product](tableName: String, spark: SparkSession): Dataset[T] = { import spark.implicits._ spark.sql(s"SELECT * FROM $tableName").as[T] } This is the error: Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases. [error] spark.sql(s"SELECT * FROM $tableName").as[T] According to the source code for org.apache

Apache spark join with dynamic re-partitionion

阅读更多关于 Apache spark join with dynamic re-partitionion

问题 I'm trying to do a fairly straightforward join on two tables, nothing complicated. Load both tables, do a join and update columns but it keeps throwing an exception. I noticed the task is stuck on the last partition 199/200 and eventually crashes. My suspicion is that the data is skewed causing all the data to be loaded in the last partition 199 . SELECT COUNT(DISTINCT report_audit) FROM ReportDs = 1.5million. While SELECT COUNT(*) FROM ReportDs = 57million. Cluster details CPU: 40 cores

How to find first non-null values in groups? (secondary sorting using dataset api)

阅读更多关于 How to find first non-null values in groups? (secondary sorting using dataset api)

I am working on a dataset which represents a stream of events (like fired as tracking events from a website). All the events have a timestamp. One use case we often have is trying to find the 1st non null value for a given field. So for example something like gets us most the way there: val eventsDf = spark.read.json(jsonEventsPath) case class ProjectedFields(visitId: String, userId: Int, timestamp: Long ... ) val projectedEventsDs = eventsDf.select( eventsDf("message.visit.id").alias("visitId"), eventsDf("message.property.user_id").alias("userId"), eventsDf("message.property.timestamp"), ...

Spark Streamming : Reading data from kafka that has multiple schema

阅读更多关于 Spark Streamming : Reading data from kafka that has multiple schema

I am struggling with the implementation in spark streaming. The messages from the kafka looks like this but with with more fields {"event":"sensordata", "source":"sensors", "payload": {"actual data as a json}} {"event":"databasedata", "mysql":"sensors", "payload": {"actual data as a json}} {"event":"eventApi", "source":"event1", "payload": {"actual data as a json}} {"event":"eventapi", "source":"event2", "payload": {"actual data as a json}} I am trying to read the messages from a Kafka topic (which has multiple schemas). I need to read each message and look for an event and source field and

Spark 2 Dataset Null value exception

阅读更多关于 Spark 2 Dataset Null value exception

Getting this null error in spark Dataset.filter Input CSV: name,age,stat abc,22,m xyz,,s Working code: case class Person(name: String, age: Long, stat: String) val peopleDS = spark.read.option("inferSchema","true") .option("header", "true").option("delimiter", ",") .csv("./people.csv").as[Person] peopleDS.show() peopleDS.createOrReplaceTempView("people") spark.sql("select * from people where age > 30").show() Failing code ( Adding following lines return error ): val filteredDS = peopleDS.filter(_.age > 30) filteredDS.show() Returns null error java.lang.RuntimeException: Null value appeared in