apache-spark-dataset

Spark Dataset : Example : Unable to generate an encoder issue

天涯浪子 提交于 2019-11-29 07:09:24
New to spark world and trying a dataset example written in scala that I found online On running it through SBT , i keep on getting the following error org.apache.spark.sql.AnalysisException: Unable to generate an encoder for inner class Any idea what am i overlooking Also feel free to point out better way of writing the same dataset example Thanks > sbt> runMain DatasetExample Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 16/10/25 01:06:39 INFO Remoting: Starting remoting 16/10/25 01:06:46 INFO Remoting: Remoting started; listening on addresses :[akka.tcp:/

Why is the error “Unable to find encoder for type stored in a Dataset” when encoding JSON using case classes?

懵懂的女人 提交于 2019-11-29 01:35:00
I've written spark job: object SimpleApp { def main(args: Array[String]) { val conf = new SparkConf().setAppName("Simple Application").setMaster("local") val sc = new SparkContext(conf) val ctx = new org.apache.spark.sql.SQLContext(sc) import ctx.implicits._ case class Person(age: Long, city: String, id: String, lname: String, name: String, sex: String) case class Person2(name: String, age: Long, city: String) val persons = ctx.read.json("/tmp/persons.json").as[Person] persons.printSchema() } } In IDE when I run the main function, 2 error occurs: Error:(15, 67) Unable to find encoder for type

Spark Dataset aggregation similar to RDD aggregate(zero)(accum, combiner)

偶尔善良 提交于 2019-11-28 09:37:45
问题 RDD has a very useful method aggregate that allows to accumulate with some zero value and combine that across partitions. Is there any way to do that with Dataset[T] . As far as I see the specification via Scala doc, there is actually nothing capable of doing that. Even the reduce method allows to do things only for binary operations with T as both arguments. Any reason why? And if there is anything capable of doing the same? Thanks a lot! VK 回答1: There are two different classes which can be

Spark dynamic DAG is a lot slower and different from hard coded DAG

纵饮孤独 提交于 2019-11-28 09:13:28
问题 I have an operation in spark which should be performed for several columns in a data frame. Generally, there are 2 possibilities to specify such operations hardcode handleBias("bar", df) .join(handleBias("baz", df), df.columns) .drop(columnsToDrop: _*).show dynamically generate them from a list of colnames var isFirst = true var res = df for (col <- columnsToDrop ++ columnsToCode) { if (isFirst) { res = handleBias(col, res) isFirst = false } else { res = handleBias(col, res) } } res.drop

Why do columns change to nullable in Apache Spark SQL?

亡梦爱人 提交于 2019-11-28 01:11:54
Why is nullable = true used after some functions are executed even though there are no NaN values in the DataFrame . val myDf = Seq((2,"A"),(2,"B"),(1,"C")) .toDF("foo","bar") .withColumn("foo", 'foo.cast("Int")) myDf.withColumn("foo_2", when($"foo" === 2 , 1).otherwise(0)).select("foo", "foo_2").show When df.printSchema is called now nullable will be false for both columns. val foo: (Int => String) = (t: Int) => { fooMap.get(t) match { case Some(tt) => tt case None => "notFound" } } val fooMap = Map( 1 -> "small", 2 -> "big" ) val fooUDF = udf(foo) myDf .withColumn("foo", fooUDF(col("foo")))

How to get keys and values from MapType column in SparkSQL DataFrame

。_饼干妹妹 提交于 2019-11-27 21:30:27
I have data in a parquet file which has 2 fields: object_id: String and alpha: Map<> . It is read into a data frame in sparkSQL and the schema looks like this: scala> alphaDF.printSchema() root |-- object_id: string (nullable = true) |-- ALPHA: map (nullable = true) | |-- key: string | |-- value: struct (valueContainsNull = true) I am using Spark 2.0 and I am trying to create a new data frame in which columns need to be object_id plus keys of the ALPHA map as in object_id, key1, key2, key2, ... I was first trying to see if I could at least access the map like this: scala> alphaDF.map(a => a(0)

spark createOrReplaceTempView vs createGlobalTempView

戏子无情 提交于 2019-11-27 20:39:24
Spark Dataset 2.0 provides two functions createOrReplaceTempView and createGlobalTempView . I am not able to understand the basic difference between both functions. According to API documents : createOrReplaceTempView : The lifetime of this temporary view is tied to the [[SparkSession]] that was used to create this Dataset. So, when I call sparkSession.close() the defined will be destroyed. is it true? createGlobalTempView : The lifetime of this temporary view is tied to this Spark application. when this type of view will be destroyed? any example. like sparkSession.close()? Gökhan Ayhan df

How to create a Dataset of Maps?

心已入冬 提交于 2019-11-27 14:51:51
I'm using Spark 2.2 and am running into troubles when attempting to call spark.createDataset on a Seq of Map . Code and output from my Spark Shell session follow: // createDataSet on Seq[T] where T = Int works scala> spark.createDataset(Seq(1, 2, 3)).collect res0: Array[Int] = Array(1, 2, 3) scala> spark.createDataset(Seq(Map(1 -> 2))).collect <console>:24: error: Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future

What is the difference between Spark DataSet and RDD

谁都会走 提交于 2019-11-27 14:28:21
I'm still struggling to understand the full power of the recently introduced Spark Datasets. Are there best practices of when to use RDDs and when to use Datasets? In their announcement Databricks explains that by using Datasets staggering reductions in both runtime and memory can be achieved. Still it is claimed that Datasets are designed ''to work alongside the existing RDD API''. Is this just a reference to downward compatibility or are there scenarios where one would prefer to use RDDs over Datasets? zero323 At this moment (Spark 1.6.0) DataSet API is just a preview and only a small subset

Overwrite only some partitions in a partitioned spark Dataset

半腔热情 提交于 2019-11-27 12:28:52
How can we overwrite a partitioned dataset, but only the partitions we are going to change? For example, recomputing last week daily job, and only overwriting last week of data. Default Spark behaviour is to overwrite the whole table, even if only some partitions are going to be written. Madhava Carrillo Since Spark 2.3.0 this is an option when overwriting a table. To overwrite it, you need to set the new spark.sql.sources.partitionOverwriteMode setting to dynamic , the dataset needs to be partitioned, and the write mode overwrite . Example: spark.conf.set( "spark.sql.sources