apache-spark-dataset

Difference between SparkContext, JavaSparkContext, SQLContext, and SparkSession?

半腔热情 提交于 2019-11-27 11:08:05
What is the difference between SparkContext, JavaSparkContext, SQLContext and SparkSession ? Is there any method to convert or create a Context using a SparkSession ? Can I completely replace all the Contexts using one single entry SparkSession ? Are all the functions in SQLContext , SparkContext , and JavaSparkContext also in SparkSession ? Some functions like parallelize have different behaviors in SparkContext and JavaSparkContext . How do they behave in SparkSession ? How can I create the following using a SparkSession ? RDD JavaRDD JavaPairRDD Dataset Is there a method to transform a

Encode an ADT / sealed trait hierarchy into Spark DataSet column

我的梦境 提交于 2019-11-27 09:29:10
If I want to store an Algebraic Data Type (ADT) (ie a Scala sealed trait hierarchy) within a Spark DataSet column, what is the best encoding strategy? For example, if I have an ADT where the leaf types store different kinds of data: sealed trait Occupation case object SoftwareEngineer extends Occupation case class Wizard(level: Int) extends Occupation case class Other(description: String) extends Occupation Whats the best way to construct a: org.apache.spark.sql.DataSet[Occupation] TL;DR There is no good solution right now, and given Spark SQL / Dataset implementation, it is unlikely there

Perform a typed join in Scala with Spark Datasets

做~自己de王妃 提交于 2019-11-27 07:06:17
I like Spark Datasets as they give me analysis errors and syntax errors at compile time and also allow me to work with getters instead of hard-coded names/numbers. Most computations can be accomplished with Dataset’s high-level APIs. For example, it’s much simpler to perform agg, select, sum, avg, map, filter, or groupBy operations by accessing a Dataset typed object’s than using RDD rows’ data fields. However the join operation is missing from this, I read that I can do a join like this ds1.joinWith(ds2, ds1.toDF().col("key") === ds2.toDF().col("key"), "inner") But that is not what I want as

Why is predicate pushdown not used in typed Dataset API (vs untyped DataFrame API)?

随声附和 提交于 2019-11-27 04:38:02
问题 I always thought that dataset/dataframe API's are the same.. and the only difference is that dataset API will give you compile time safety. Right ? So.. I have very simple case: case class Player (playerID: String, birthYear: Int) val playersDs: Dataset[Player] = session.read .option("header", "true") .option("delimiter", ",") .option("inferSchema", "true") .csv(PeopleCsv) .as[Player] // Let's try to find players born in 1999. // This will work, you have compile time safety... but it will not

Overwrite only some partitions in a partitioned spark Dataset

三世轮回 提交于 2019-11-27 01:10:04
问题 How can we overwrite a partitioned dataset, but only the partitions we are going to change? For example, recomputing last week daily job, and only overwriting last week of data. Default Spark behaviour is to overwrite the whole table, even if only some partitions are going to be written. 回答1: Since Spark 2.3.0 this is an option when overwriting a table. To overwrite it, you need to set the new spark.sql.sources.partitionOverwriteMode setting to dynamic , the dataset needs to be partitioned,

How to get keys and values from MapType column in SparkSQL DataFrame

百般思念 提交于 2019-11-26 20:27:18
问题 I have data in a parquet file which has 2 fields: object_id: String and alpha: Map<> . It is read into a data frame in sparkSQL and the schema looks like this: scala> alphaDF.printSchema() root |-- object_id: string (nullable = true) |-- ALPHA: map (nullable = true) | |-- key: string | |-- value: struct (valueContainsNull = true) I am using Spark 2.0 and I am trying to create a new data frame in which columns need to be object_id plus keys of the ALPHA map as in object_id, key1, key2, key2, .

What is the difference between Spark DataSet and RDD

三世轮回 提交于 2019-11-26 18:24:36
问题 I'm still struggling to understand the full power of the recently introduced Spark Datasets. Are there best practices of when to use RDDs and when to use Datasets? In their announcement Databricks explains that by using Datasets staggering reductions in both runtime and memory can be achieved. Still it is claimed that Datasets are designed ''to work alongside the existing RDD API''. Is this just a reference to downward compatibility or are there scenarios where one would prefer to use RDDs

How to create a Dataset of Maps?

为君一笑 提交于 2019-11-26 16:56:16
问题 I'm using Spark 2.2 and am running into troubles when attempting to call spark.createDataset on a Seq of Map . Code and output from my Spark Shell session follow: // createDataSet on Seq[T] where T = Int works scala> spark.createDataset(Seq(1, 2, 3)).collect res0: Array[Int] = Array(1, 2, 3) scala> spark.createDataset(Seq(Map(1 -> 2))).collect <console>:24: error: Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are

Difference between SparkContext, JavaSparkContext, SQLContext, and SparkSession?

梦想与她 提交于 2019-11-26 15:24:20
问题 What is the difference between SparkContext, JavaSparkContext, SQLContext and SparkSession ? Is there any method to convert or create a Context using a SparkSession ? Can I completely replace all the Contexts using one single entry SparkSession ? Are all the functions in SQLContext , SparkContext , and JavaSparkContext also in SparkSession ? Some functions like parallelize have different behaviors in SparkContext and JavaSparkContext . How do they behave in SparkSession ? How can I create the

Difference between DataFrame, Dataset, and RDD in Spark

风格不统一 提交于 2019-11-26 14:47:34
I'm just wondering what is the difference between an RDD and DataFrame (Spark 2.0.0 DataFrame is a mere type alias for Dataset[Row] ) in Apache Spark? Can you convert one to the other? Justin Pihony A DataFrame is defined well with a google search for "DataFrame definition": A data frame is a table, or two-dimensional array-like structure, in which each column contains measurements on one variable, and each row contains one case. So, a DataFrame has additional metadata due to its tabular format, which allows Spark to run certain optimizations on the finalized query. An RDD , on the other hand,