apache-spark-dataset

Spark: java.lang.UnsupportedOperationException: No Encoder found for java.time.LocalDate

给你一囗甜甜゛ 提交于 2019-12-07 11:36:14
问题 I'm writing a Spark application using version 2.1.1. The following code got the error when calling a method with LocalDate parameter? Exception in thread "main" java.lang.UnsupportedOperationException: No Encoder found for java.time.LocalDate - field (class: "java.time.LocalDate", name: "_2") - root class: "scala.Tuple2" at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:602) at org.apache.spark.sql.catalyst

How to save nested or JSON object in spark Dataset with converting to RDD?

北战南征 提交于 2019-12-06 22:34:37
I am working on the spark code where I have to save multiple column values as a object format and save the result to mongodb Given Dataset |---|-----|------|----------| |A |A_SRC|Past_A|Past_A_SRC| |---|-----|------|----------| |a1 | s1 | a2 | s2 | What I Have tried val ds1 = Seq(("1", "2", "3","4")).toDF("a", "src", "p_a","p_src") val recordCol = functions.to_json(Seq($"a", $"src", $"p_a",$"p_src"),struct("a", "src", "p_a","p_src")) as "A" ds1.select(recordCol).show(truncate = false) gives me result like +-----------------------------------------+ |A | +---------------------------------------

How to convert a JavaPairRDD to Dataset?

一世执手 提交于 2019-12-06 16:46:23
SparkSession.createDataset() only allows List, RDD, or Seq - but it doesn't support JavaPairRDD . So if I have a JavaPairRDD<String, User> that I want to create a Dataset from, would a viable workround for the SparkSession.createDataset() limitation to create a wrapper UserMap class that contains two fields: String and User . Then do spark.createDataset(userMap, Encoders.bean(UserMap.class)); ? If you can convert the JavaPairRDD to List<Tuple2<K, V>> then you can use createDataset method which takes List. See below sample code. JavaPairRDD<String, User> pairRDD = ...; Dataset<Row> df = spark

Spark issues reading parquet files

怎甘沉沦 提交于 2019-12-06 12:25:13
I have 2 parquet part files part-00043-0bfd7e28-6469-4849-8692-e625c25485e2-c000.snappy.parquet (is part file from 2017 Nov 14th run ) and part-00199-64714828-8a9e-4ae1-8735-c5102c0a834d-c000.snappy.parquet (is part file from 2017 Nov 16th run ) and have both having same schema (which I verified by printing schema). My problem is that I have, say 10 columns which is coming properly if I read this 2 files separately using Spark. But if I put this file is folder are try to read together, total count is coming correct (sum of rows from 2 files) but from 2nd file most of the columns are null. Only

scala generic encoder for spark case class

♀尐吖头ヾ 提交于 2019-12-06 03:16:04
问题 How can I get this method to compile. Strangely, sparks implicit are already imported. def loadDsFromHive[T <: Product](tableName: String, spark: SparkSession): Dataset[T] = { import spark.implicits._ spark.sql(s"SELECT * FROM $tableName").as[T] } This is the error: Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future

Generic T as Spark Dataset[T] constructor

久未见 提交于 2019-12-05 15:59:29
In the following snippet, the tryParquet function tries to load a Dataset from a Parquet file if it exists. If not, it computes, persists and returns back the Dataset plan which was provided: import scala.util.{Try, Success, Failure} import org.apache.spark.sql.SparkSession import org.apache.spark.sql.Dataset sealed trait CustomRow case class MyRow( id: Int, name: String ) extends CustomRow val ds: Dataset[MyRow] = Seq((1, "foo"), (2, "bar"), (3, "baz")).toDF("id", "name").as[MyRow] def tryParquet[T <: CustomRow](session: SparkSession, path: String, target: Dataset[T]): Dataset[T] = Try

Hive partitions, Spark partitions and joins in Spark - how they relate

主宰稳场 提交于 2019-12-05 13:09:09
Trying to understand how Hive partitions relate to Spark partitions, culminating in a question about joins. I have 2 external Hive tables; both backed by S3 buckets and partitioned by date ; so in each bucket there are keys with name format date=<yyyy-MM-dd>/<filename> . Question 1: If I read this data into Spark: val table1 = spark.table("table1").as[Table1Row] val table2 = spark.table("table2").as[Table2Row] then how many partitions are the resultant datasets going to have respectively? Partitions equal to the number of objects in S3? Question 2 : Suppose the two row types have the following

Spark: java.lang.UnsupportedOperationException: No Encoder found for java.time.LocalDate

馋奶兔 提交于 2019-12-05 11:57:47
I'm writing a Spark application using version 2.1.1. The following code got the error when calling a method with LocalDate parameter? Exception in thread "main" java.lang.UnsupportedOperationException: No Encoder found for java.time.LocalDate - field (class: "java.time.LocalDate", name: "_2") - root class: "scala.Tuple2" at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:602) at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$9.apply(ScalaReflection.scala:596) at org.apache.spark.sql.catalyst

Spark Dataset unique id performance - row_number vs monotonically_increasing_id

只愿长相守 提交于 2019-12-05 08:14:45
I want to assign a unique Id to my dataset rows. I know that there are two implementation options: First option: import org.apache.spark.sql.expressions.Window; ds.withColumn("id",row_number().over(Window.orderBy("a column"))) Second option: df.withColumn("id", monotonically_increasing_id()) The second option is not sequential ID and it doesn't really matter. I'm trying to figure out is if there are any performance issues of those implementation. That is, if one of this option is very slow compared to the other. Something more meaningful that: "monotonically_increasing_id is very fast over row

Generic iterator over dataframe (Spark/scala)

谁说我不能喝 提交于 2019-12-05 07:00:29
问题 I need to iterate over data frame in specific order and apply some complex logic to calculate new column. In below example I'll be using simple expression where current value for s is multiplication of all previous values thus it may seem like this can be done using UDF or even analytic functions. However, in reality logic is much more complex. Below code does what is needed import org.apache.spark.sql.Row import org.apache.spark.sql.types._ import org.apache.spark.sql.catalyst.encoders