apache-spark-dataset | 易学教程

Difference between SparkContext, JavaSparkContext, SQLContext, and SparkSession?

阅读更多关于 Difference between SparkContext, JavaSparkContext, SQLContext, and SparkSession?

What is the difference between SparkContext, JavaSparkContext, SQLContext and SparkSession ? Is there any method to convert or create a Context using a SparkSession ? Can I completely replace all the Contexts using one single entry SparkSession ? Are all the functions in SQLContext , SparkContext , and JavaSparkContext also in SparkSession ? Some functions like parallelize have different behaviors in SparkContext and JavaSparkContext . How do they behave in SparkSession ? How can I create the following using a SparkSession ? RDD JavaRDD JavaPairRDD Dataset Is there a method to transform a

Encode an ADT / sealed trait hierarchy into Spark DataSet column

阅读更多关于 Encode an ADT / sealed trait hierarchy into Spark DataSet column

If I want to store an Algebraic Data Type (ADT) (ie a Scala sealed trait hierarchy) within a Spark DataSet column, what is the best encoding strategy? For example, if I have an ADT where the leaf types store different kinds of data: sealed trait Occupation case object SoftwareEngineer extends Occupation case class Wizard(level: Int) extends Occupation case class Other(description: String) extends Occupation Whats the best way to construct a: org.apache.spark.sql.DataSet[Occupation] TL;DR There is no good solution right now, and given Spark SQL / Dataset implementation, it is unlikely there

Perform a typed join in Scala with Spark Datasets

阅读更多关于 Perform a typed join in Scala with Spark Datasets

I like Spark Datasets as they give me analysis errors and syntax errors at compile time and also allow me to work with getters instead of hard-coded names/numbers. Most computations can be accomplished with Dataset’s high-level APIs. For example, it’s much simpler to perform agg, select, sum, avg, map, filter, or groupBy operations by accessing a Dataset typed object’s than using RDD rows’ data fields. However the join operation is missing from this, I read that I can do a join like this ds1.joinWith(ds2, ds1.toDF().col("key") === ds2.toDF().col("key"), "inner") But that is not what I want as

Why is predicate pushdown not used in typed Dataset API (vs untyped DataFrame API)?

阅读更多关于 Why is predicate pushdown not used in typed Dataset API (vs untyped DataFrame API)?

问题 I always thought that dataset/dataframe API's are the same.. and the only difference is that dataset API will give you compile time safety. Right ? So.. I have very simple case: case class Player (playerID: String, birthYear: Int) val playersDs: Dataset[Player] = session.read .option("header", "true") .option("delimiter", ",") .option("inferSchema", "true") .csv(PeopleCsv) .as[Player] // Let's try to find players born in 1999. // This will work, you have compile time safety... but it will not

Overwrite only some partitions in a partitioned spark Dataset

阅读更多关于 Overwrite only some partitions in a partitioned spark Dataset

问题 How can we overwrite a partitioned dataset, but only the partitions we are going to change? For example, recomputing last week daily job, and only overwriting last week of data. Default Spark behaviour is to overwrite the whole table, even if only some partitions are going to be written. 回答1: Since Spark 2.3.0 this is an option when overwriting a table. To overwrite it, you need to set the new spark.sql.sources.partitionOverwriteMode setting to dynamic , the dataset needs to be partitioned,

How to get keys and values from MapType column in SparkSQL DataFrame

阅读更多关于 How to get keys and values from MapType column in SparkSQL DataFrame

问题 I have data in a parquet file which has 2 fields: object_id: String and alpha: Map<> . It is read into a data frame in sparkSQL and the schema looks like this: scala> alphaDF.printSchema() root |-- object_id: string (nullable = true) |-- ALPHA: map (nullable = true) | |-- key: string | |-- value: struct (valueContainsNull = true) I am using Spark 2.0 and I am trying to create a new data frame in which columns need to be object_id plus keys of the ALPHA map as in object_id, key1, key2, key2, .

What is the difference between Spark DataSet and RDD

阅读更多关于 What is the difference between Spark DataSet and RDD

问题 I'm still struggling to understand the full power of the recently introduced Spark Datasets. Are there best practices of when to use RDDs and when to use Datasets? In their announcement Databricks explains that by using Datasets staggering reductions in both runtime and memory can be achieved. Still it is claimed that Datasets are designed ''to work alongside the existing RDD API''. Is this just a reference to downward compatibility or are there scenarios where one would prefer to use RDDs

How to create a Dataset of Maps?

阅读更多关于 How to create a Dataset of Maps?

问题 I'm using Spark 2.2 and am running into troubles when attempting to call spark.createDataset on a Seq of Map . Code and output from my Spark Shell session follow: // createDataSet on Seq[T] where T = Int works scala> spark.createDataset(Seq(1, 2, 3)).collect res0: Array[Int] = Array(1, 2, 3) scala> spark.createDataset(Seq(Map(1 -> 2))).collect <console>:24: error: Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are

Difference between SparkContext, JavaSparkContext, SQLContext, and SparkSession?

阅读更多关于 Difference between SparkContext, JavaSparkContext, SQLContext, and SparkSession?

问题 What is the difference between SparkContext, JavaSparkContext, SQLContext and SparkSession ? Is there any method to convert or create a Context using a SparkSession ? Can I completely replace all the Contexts using one single entry SparkSession ? Are all the functions in SQLContext , SparkContext , and JavaSparkContext also in SparkSession ? Some functions like parallelize have different behaviors in SparkContext and JavaSparkContext . How do they behave in SparkSession ? How can I create the

Difference between DataFrame, Dataset, and RDD in Spark

阅读更多关于 Difference between DataFrame, Dataset, and RDD in Spark

I'm just wondering what is the difference between an RDD and DataFrame (Spark 2.0.0 DataFrame is a mere type alias for Dataset[Row] ) in Apache Spark? Can you convert one to the other? Justin Pihony A DataFrame is defined well with a google search for "DataFrame definition": A data frame is a table, or two-dimensional array-like structure, in which each column contains measurements on one variable, and each row contains one case. So, a DataFrame has additional metadata due to its tabular format, which allows Spark to run certain optimizations on the finalized query. An RDD , on the other hand,