apache-spark-dataset | 易学教程

Encoder for Row Type Spark Datasets

阅读更多关于 Encoder for Row Type Spark Datasets

问题 I would like to write an encoder for a Row type in DataSet, for a map operation that I am doing. Essentially, I do not understand how to write encoders. Below is an example of a map operation: In the example below, instead of returning Dataset<String>, I would like to return Dataset<Row> Dataset<String> output = dataset1.flatMap(new FlatMapFunction<Row, String>() { @Override public Iterator<String> call(Row row) throws Exception { ArrayList<String> obj = //some map operation return obj

When to use Spark DataFrame/Dataset API and when to use plain RDD?

阅读更多关于 When to use Spark DataFrame/Dataset API and when to use plain RDD?

问题 Spark SQL DataFrame/Dataset execution engine has several extremely efficient time & space optimizations (e.g. InternalRow & expression codeGen). According to many documentations, it seems to be a better option than RDD for most distributed algorithms. However, I did some sourcecode research and am still not convinced. I have no doubt that InternalRow is much more compact and can save large amount of memory. But execution of algorithms may not be any faster saving predefined expressions.

When to use Spark DataFrame/Dataset API and when to use plain RDD?

阅读更多关于 When to use Spark DataFrame/Dataset API and when to use plain RDD?

How to use approxQuantile by group?

阅读更多关于 How to use approxQuantile by group?

问题 Spark has SQL function percentile_approx() , and its Scala counterpart is df.stat.approxQuantile() . However, the Scala counterpart cannot be used on grouped datasets, something like df.groupby("foo").stat.approxQuantile() , as answered here: https://stackoverflow.com/a/51933027. But it's possible to do both grouping and percentiles in SQL syntax. So I'm wondering, maybe I can define an UDF from SQL percentile_approx function and use it on my grouped dataset? 回答1: While you cannot use

Why is the error “Unable to find encoder for type stored in a Dataset” when encoding JSON using case classes?

阅读更多关于 Why is the error “Unable to find encoder for type stored in a Dataset” when encoding JSON using case classes?

问题 I've written spark job: object SimpleApp { def main(args: Array[String]) { val conf = new SparkConf().setAppName("Simple Application").setMaster("local") val sc = new SparkContext(conf) val ctx = new org.apache.spark.sql.SQLContext(sc) import ctx.implicits._ case class Person(age: Long, city: String, id: String, lname: String, name: String, sex: String) case class Person2(name: String, age: Long, city: String) val persons = ctx.read.json("/tmp/persons.json").as[Person] persons.printSchema() }

spark createOrReplaceTempView vs createGlobalTempView

阅读更多关于 spark createOrReplaceTempView vs createGlobalTempView

问题 Spark Dataset 2.0 provides two functions createOrReplaceTempView and createGlobalTempView . I am not able to understand the basic difference between both functions. According to API documents: createOrReplaceTempView : The lifetime of this temporary view is tied to the [[SparkSession]] that was used to create this Dataset. So, when I call sparkSession.close() the defined will be destroyed. is it true? createGlobalTempView : The lifetime of this temporary view is tied to this Spark application

Encoder error while trying to map dataframe row to updated row

阅读更多关于 Encoder error while trying to map dataframe row to updated row

问题 When I m trying to do the same thing in my code as mentioned below dataframe.map(row => { val row1 = row.getAs[String](1) val make = if (row1.toLowerCase == "tesla") "S" else row1 Row(row(0),make,row(2)) }) I have taken the above reference from here: Scala: How can I replace value in Dataframs using scala But I am getting encoder error as Unable to find encoder for type stored in a Dataset. Primitive types (Int, S tring, etc) and Product types (case classes) are supported by importing spark

Spark 2.0 Dataset vs DataFrame

阅读更多关于 Spark 2.0 Dataset vs DataFrame

问题 starting out with spark 2.0.1 I got some questions. I read a lot of documentation but so far could not find sufficient answers: What is the difference between df.select("foo") df.select($"foo") do I understand correctly that myDataSet.map(foo.someVal) is typesafe and will not convert into RDD but stay in DataSet representation / no additional overhead (performance wise for 2.0.0) all the other commands e.g. select, .. are just syntactic sugar. They are not typesafe and a map could be used

Getting the Summary of Whole Dataset or Only Columns in Apache Spark Java

阅读更多关于 Getting the Summary of Whole Dataset or Only Columns in Apache Spark Java

问题 For below Dataset, to get Total Summary values of Col1 , I did import org.apache.spark.sql.functions._ val totaldf = df.groupBy("Col1").agg(lit("Total").as("Col2"), sum("price").as("price"), sum("displayPrice").as("displayPrice")) and then merged with df.union(totaldf).orderBy(col("Col1"), col("Col2").desc).show(false) df. +-----------+-------+--------+--------------+ | Col1 | Col2 | price | displayPrice | +-----------+-------+--------+--------------+ | Category1 | item1 | 15 | 14 | |

saveAsTextFile hangs in spark java.io.IOException: Connection reset by peer in Data frame

阅读更多关于 saveAsTextFile hangs in spark java.io.IOException: Connection reset by peer in Data frame

问题 I am running an application in spark which do the simple diff between two data frame . I execute as jar file in my cluster environment . My cluster environment is 94 node cluster. There are two data set 2 GB and 4 GB which mapped to data frame . My job is working fine for the very small size files ... I personal think saveAsTextFile takes more time in my application Below my cluster connfig details Total Vmem allocated for Containers 394.80 GB Total Vmem allocated for Containers 394.80 GB