apache-spark-dataset

How to split multi-value column into separate rows using typed Dataset?

会有一股神秘感。 提交于 2019-11-30 16:58:18
问题 I am facing an issue of how to split a multi-value column, i.e. List[String] , into separate rows. The initial dataset has following types: Dataset[(Integer, String, Double, scala.List[String])] +---+--------------------+-------+--------------------+ | id| text | value | properties | +---+--------------------+-------+--------------------+ | 0|Lorem ipsum dolor...| 1.0|[prp1, prp2, prp3..]| | 1|Lorem ipsum dolor...| 2.0|[prp4, prp5, prp6..]| | 2|Lorem ipsum dolor...| 3.0|[prp7, prp8, prp9..]|

Spark Dataframes- Reducing By Key

Deadly 提交于 2019-11-30 15:57:25
问题 Let's say I have a data structure like this where ts is some timestamp case class Record(ts: Long, id: Int, value: Int) Given a large number of these records I want to end up with the record with the highest timestamp for each id. Using the RDD api I think the following code gets the job done: def findLatest(records: RDD[Record])(implicit spark: SparkSession) = { records.keyBy(_.id).reduceByKey{ (x, y) => if(x.ts > y.ts) x else y }.values } Likewise this is my attempt with datasets: def

Spark Dataframes- Reducing By Key

|▌冷眼眸甩不掉的悲伤 提交于 2019-11-30 14:56:37
Let's say I have a data structure like this where ts is some timestamp case class Record(ts: Long, id: Int, value: Int) Given a large number of these records I want to end up with the record with the highest timestamp for each id. Using the RDD api I think the following code gets the job done: def findLatest(records: RDD[Record])(implicit spark: SparkSession) = { records.keyBy(_.id).reduceByKey{ (x, y) => if(x.ts > y.ts) x else y }.values } Likewise this is my attempt with datasets: def findLatest(records: Dataset[Record])(implicit spark: SparkSession) = { records.groupByKey(_.id).mapGroups{

Spark Dataset API - join

烂漫一生 提交于 2019-11-30 14:37:45
问题 I am trying to use the Spark Dataset API but I am having some issues doing a simple join. Let's say I have two dataset with fields: date | value , then in the case of DataFrame my join would look like: val dfA : DataFrame val dfB : DataFrame dfA.join(dfB, dfB("date") === dfA("date") ) However for Dataset there is the .joinWith method, but the same approach does not work: val dfA : Dataset val dfB : Dataset dfA.joinWith(dfB, ? ) What is the argument required by .joinWith ? 回答1: To use joinWith

Spark 2.0 implicit encoder, deal with missing column when type is Option[Seq[String]] (scala)

做~自己de王妃 提交于 2019-11-30 08:37:21
问题 I'm having some trouble encoding data when some columns that are of type Option[Seq[String]] are missing from our data source. Ideally I would like the missing column data to be filled with None . Scenario: We have some parquet files that we are reading in that have column1 but not column2 . We load the data in from these parquet files into a Dataset , and cast it as MyType . case class MyType(column1: Option[String], column2: Option[Seq[String]]) sqlContext.read.parquet("dataSource.parquet")

Spark Dataset : Example : Unable to generate an encoder issue

[亡魂溺海] 提交于 2019-11-30 08:33:16
问题 New to spark world and trying a dataset example written in scala that I found online On running it through SBT , i keep on getting the following error org.apache.spark.sql.AnalysisException: Unable to generate an encoder for inner class Any idea what am i overlooking Also feel free to point out better way of writing the same dataset example Thanks > sbt> runMain DatasetExample Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 16/10/25 01:06:39 INFO Remoting:

How to read multiple Excel files and concatenate them into one Apache Spark DataFrame?

房东的猫 提交于 2019-11-29 23:09:08
问题 Recently I wanted to do Spark Machine Learning Lab from Spark Summit 2016. Training video is here and exported notebook is available here. The dataset used in the lab can be downloaded from UCI Machine Learning Repository. It contains a set of readings from various sensors in a gas-fired power generation plant. The format is xlsx file with five sheets. To use the data in the lab I needed to read all the sheets form the Excel file and to concatenate them into one Spark DataFrame. During the

Spark dynamic DAG is a lot slower and different from hard coded DAG

烈酒焚心 提交于 2019-11-29 15:20:43
I have an operation in spark which should be performed for several columns in a data frame. Generally, there are 2 possibilities to specify such operations hardcode handleBias("bar", df) .join(handleBias("baz", df), df.columns) .drop(columnsToDrop: _*).show dynamically generate them from a list of colnames var isFirst = true var res = df for (col <- columnsToDrop ++ columnsToCode) { if (isFirst) { res = handleBias(col, res) isFirst = false } else { res = handleBias(col, res) } } res.drop(columnsToDrop: _*).show The problem is that the DAG generated dynamically is different and the runtime of

Partition data for efficient joining for Spark dataframe/dataset

房东的猫 提交于 2019-11-29 10:35:39
问题 I need to join many DataFrames together based on some shared key columns. For a key-value RDD, one can specify a partitioner so that data points with same key are shuffled to same executor so joining is more efficient (if one has shuffle related operations before the join ). Can the same thing can be done on Spark DataFrames or DataSets? 回答1: You can repartition a DataFrame after loading it if you know you'll be joining it multiple times val users = spark.read.load("/path/to/users")

Spark 2.0 implicit encoder, deal with missing column when type is Option[Seq[String]] (scala)

守給你的承諾、 提交于 2019-11-29 07:13:17
I'm having some trouble encoding data when some columns that are of type Option[Seq[String]] are missing from our data source. Ideally I would like the missing column data to be filled with None . Scenario: We have some parquet files that we are reading in that have column1 but not column2 . We load the data in from these parquet files into a Dataset , and cast it as MyType . case class MyType(column1: Option[String], column2: Option[Seq[String]]) sqlContext.read.parquet("dataSource.parquet").as[MyType] org.apache.spark.sql.AnalysisException: cannot resolve ' column2 ' given input columns: