apache-spark-dataset

Encoder for Row Type Spark Datasets

戏子无情 提交于 2019-12-20 08:36:53
问题 I would like to write an encoder for a Row type in DataSet, for a map operation that I am doing. Essentially, I do not understand how to write encoders. Below is an example of a map operation: In the example below, instead of returning Dataset<String>, I would like to return Dataset<Row> Dataset<String> output = dataset1.flatMap(new FlatMapFunction<Row, String>() { @Override public Iterator<String> call(Row row) throws Exception { ArrayList<String> obj = //some map operation return obj

When to use Spark DataFrame/Dataset API and when to use plain RDD?

百般思念 提交于 2019-12-19 17:13:27
问题 Spark SQL DataFrame/Dataset execution engine has several extremely efficient time & space optimizations (e.g. InternalRow & expression codeGen). According to many documentations, it seems to be a better option than RDD for most distributed algorithms. However, I did some sourcecode research and am still not convinced. I have no doubt that InternalRow is much more compact and can save large amount of memory. But execution of algorithms may not be any faster saving predefined expressions.

When to use Spark DataFrame/Dataset API and when to use plain RDD?

 ̄綄美尐妖づ 提交于 2019-12-19 17:12:31
问题 Spark SQL DataFrame/Dataset execution engine has several extremely efficient time & space optimizations (e.g. InternalRow & expression codeGen). According to many documentations, it seems to be a better option than RDD for most distributed algorithms. However, I did some sourcecode research and am still not convinced. I have no doubt that InternalRow is much more compact and can save large amount of memory. But execution of algorithms may not be any faster saving predefined expressions.

How to use approxQuantile by group?

余生颓废 提交于 2019-12-19 04:15:34
问题 Spark has SQL function percentile_approx() , and its Scala counterpart is df.stat.approxQuantile() . However, the Scala counterpart cannot be used on grouped datasets, something like df.groupby("foo").stat.approxQuantile() , as answered here: https://stackoverflow.com/a/51933027. But it's possible to do both grouping and percentiles in SQL syntax. So I'm wondering, maybe I can define an UDF from SQL percentile_approx function and use it on my grouped dataset? 回答1: While you cannot use

Why is the error “Unable to find encoder for type stored in a Dataset” when encoding JSON using case classes?

丶灬走出姿态 提交于 2019-12-18 03:14:06
问题 I've written spark job: object SimpleApp { def main(args: Array[String]) { val conf = new SparkConf().setAppName("Simple Application").setMaster("local") val sc = new SparkContext(conf) val ctx = new org.apache.spark.sql.SQLContext(sc) import ctx.implicits._ case class Person(age: Long, city: String, id: String, lname: String, name: String, sex: String) case class Person2(name: String, age: Long, city: String) val persons = ctx.read.json("/tmp/persons.json").as[Person] persons.printSchema() }

spark createOrReplaceTempView vs createGlobalTempView

◇◆丶佛笑我妖孽 提交于 2019-12-17 15:43:24
问题 Spark Dataset 2.0 provides two functions createOrReplaceTempView and createGlobalTempView . I am not able to understand the basic difference between both functions. According to API documents: createOrReplaceTempView : The lifetime of this temporary view is tied to the [[SparkSession]] that was used to create this Dataset. So, when I call sparkSession.close() the defined will be destroyed. is it true? createGlobalTempView : The lifetime of this temporary view is tied to this Spark application

Encoder error while trying to map dataframe row to updated row

[亡魂溺海] 提交于 2019-12-16 22:22:41
问题 When I m trying to do the same thing in my code as mentioned below dataframe.map(row => { val row1 = row.getAs[String](1) val make = if (row1.toLowerCase == "tesla") "S" else row1 Row(row(0),make,row(2)) }) I have taken the above reference from here: Scala: How can I replace value in Dataframs using scala But I am getting encoder error as Unable to find encoder for type stored in a Dataset. Primitive types (Int, S tring, etc) and Product types (case classes) are supported by importing spark

Spark 2.0 Dataset vs DataFrame

核能气质少年 提交于 2019-12-16 20:15:32
问题 starting out with spark 2.0.1 I got some questions. I read a lot of documentation but so far could not find sufficient answers: What is the difference between df.select("foo") df.select($"foo") do I understand correctly that myDataSet.map(foo.someVal) is typesafe and will not convert into RDD but stay in DataSet representation / no additional overhead (performance wise for 2.0.0) all the other commands e.g. select, .. are just syntactic sugar. They are not typesafe and a map could be used

Getting the Summary of Whole Dataset or Only Columns in Apache Spark Java

白昼怎懂夜的黑 提交于 2019-12-13 22:29:56
问题 For below Dataset, to get Total Summary values of Col1 , I did import org.apache.spark.sql.functions._ val totaldf = df.groupBy("Col1").agg(lit("Total").as("Col2"), sum("price").as("price"), sum("displayPrice").as("displayPrice")) and then merged with df.union(totaldf).orderBy(col("Col1"), col("Col2").desc).show(false) df. +-----------+-------+--------+--------------+ | Col1 | Col2 | price | displayPrice | +-----------+-------+--------+--------------+ | Category1 | item1 | 15 | 14 | |

saveAsTextFile hangs in spark java.io.IOException: Connection reset by peer in Data frame

丶灬走出姿态 提交于 2019-12-13 19:04:01
问题 I am running an application in spark which do the simple diff between two data frame . I execute as jar file in my cluster environment . My cluster environment is 94 node cluster. There are two data set 2 GB and 4 GB which mapped to data frame . My job is working fine for the very small size files ... I personal think saveAsTextFile takes more time in my application Below my cluster connfig details Total Vmem allocated for Containers 394.80 GB Total Vmem allocated for Containers 394.80 GB