apache-spark-dataset | 易学教程

How to split multi-value column into separate rows using typed Dataset?

阅读更多关于 How to split multi-value column into separate rows using typed Dataset?

问题 I am facing an issue of how to split a multi-value column, i.e. List[String] , into separate rows. The initial dataset has following types: Dataset[(Integer, String, Double, scala.List[String])] +---+--------------------+-------+--------------------+ | id| text | value | properties | +---+--------------------+-------+--------------------+ | 0|Lorem ipsum dolor...| 1.0|[prp1, prp2, prp3..]| | 1|Lorem ipsum dolor...| 2.0|[prp4, prp5, prp6..]| | 2|Lorem ipsum dolor...| 3.0|[prp7, prp8, prp9..]|

Spark Dataframes- Reducing By Key

阅读更多关于 Spark Dataframes- Reducing By Key

问题 Let's say I have a data structure like this where ts is some timestamp case class Record(ts: Long, id: Int, value: Int) Given a large number of these records I want to end up with the record with the highest timestamp for each id. Using the RDD api I think the following code gets the job done: def findLatest(records: RDD[Record])(implicit spark: SparkSession) = { records.keyBy(_.id).reduceByKey{ (x, y) => if(x.ts > y.ts) x else y }.values } Likewise this is my attempt with datasets: def

Spark Dataframes- Reducing By Key

阅读更多关于 Spark Dataframes- Reducing By Key

Let's say I have a data structure like this where ts is some timestamp case class Record(ts: Long, id: Int, value: Int) Given a large number of these records I want to end up with the record with the highest timestamp for each id. Using the RDD api I think the following code gets the job done: def findLatest(records: RDD[Record])(implicit spark: SparkSession) = { records.keyBy(_.id).reduceByKey{ (x, y) => if(x.ts > y.ts) x else y }.values } Likewise this is my attempt with datasets: def findLatest(records: Dataset[Record])(implicit spark: SparkSession) = { records.groupByKey(_.id).mapGroups{

Spark Dataset API - join

阅读更多关于 Spark Dataset API - join

问题 I am trying to use the Spark Dataset API but I am having some issues doing a simple join. Let's say I have two dataset with fields: date | value , then in the case of DataFrame my join would look like: val dfA : DataFrame val dfB : DataFrame dfA.join(dfB, dfB("date") === dfA("date") ) However for Dataset there is the .joinWith method, but the same approach does not work: val dfA : Dataset val dfB : Dataset dfA.joinWith(dfB, ? ) What is the argument required by .joinWith ? 回答1: To use joinWith

Spark 2.0 implicit encoder, deal with missing column when type is Option[Seq[String]] (scala)

阅读更多关于 Spark 2.0 implicit encoder, deal with missing column when type is Option[Seq[String]] (scala)

问题 I'm having some trouble encoding data when some columns that are of type Option[Seq[String]] are missing from our data source. Ideally I would like the missing column data to be filled with None . Scenario: We have some parquet files that we are reading in that have column1 but not column2 . We load the data in from these parquet files into a Dataset , and cast it as MyType . case class MyType(column1: Option[String], column2: Option[Seq[String]]) sqlContext.read.parquet("dataSource.parquet")

Spark Dataset : Example : Unable to generate an encoder issue

阅读更多关于 Spark Dataset : Example : Unable to generate an encoder issue

问题 New to spark world and trying a dataset example written in scala that I found online On running it through SBT , i keep on getting the following error org.apache.spark.sql.AnalysisException: Unable to generate an encoder for inner class Any idea what am i overlooking Also feel free to point out better way of writing the same dataset example Thanks > sbt> runMain DatasetExample Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 16/10/25 01:06:39 INFO Remoting:

How to read multiple Excel files and concatenate them into one Apache Spark DataFrame?

阅读更多关于 How to read multiple Excel files and concatenate them into one Apache Spark DataFrame?

问题 Recently I wanted to do Spark Machine Learning Lab from Spark Summit 2016. Training video is here and exported notebook is available here. The dataset used in the lab can be downloaded from UCI Machine Learning Repository. It contains a set of readings from various sensors in a gas-fired power generation plant. The format is xlsx file with five sheets. To use the data in the lab I needed to read all the sheets form the Excel file and to concatenate them into one Spark DataFrame. During the

Spark dynamic DAG is a lot slower and different from hard coded DAG

阅读更多关于 Spark dynamic DAG is a lot slower and different from hard coded DAG

I have an operation in spark which should be performed for several columns in a data frame. Generally, there are 2 possibilities to specify such operations hardcode handleBias("bar", df) .join(handleBias("baz", df), df.columns) .drop(columnsToDrop: _*).show dynamically generate them from a list of colnames var isFirst = true var res = df for (col <- columnsToDrop ++ columnsToCode) { if (isFirst) { res = handleBias(col, res) isFirst = false } else { res = handleBias(col, res) } } res.drop(columnsToDrop: _*).show The problem is that the DAG generated dynamically is different and the runtime of

Partition data for efficient joining for Spark dataframe/dataset

阅读更多关于 Partition data for efficient joining for Spark dataframe/dataset

问题 I need to join many DataFrames together based on some shared key columns. For a key-value RDD, one can specify a partitioner so that data points with same key are shuffled to same executor so joining is more efficient (if one has shuffle related operations before the join ). Can the same thing can be done on Spark DataFrames or DataSets? 回答1: You can repartition a DataFrame after loading it if you know you'll be joining it multiple times val users = spark.read.load("/path/to/users")

Spark 2.0 implicit encoder, deal with missing column when type is Option[Seq[String]] (scala)

阅读更多关于 Spark 2.0 implicit encoder, deal with missing column when type is Option[Seq[String]] (scala)

I'm having some trouble encoding data when some columns that are of type Option[Seq[String]] are missing from our data source. Ideally I would like the missing column data to be filled with None . Scenario: We have some parquet files that we are reading in that have column1 but not column2 . We load the data in from these parquet files into a Dataset , and cast it as MyType . case class MyType(column1: Option[String], column2: Option[Seq[String]]) sqlContext.read.parquet("dataSource.parquet").as[MyType] org.apache.spark.sql.AnalysisException: cannot resolve ' column2 ' given input columns: