apache-spark-dataset

Feasibility of Hive to Netezza data export using spark

北城以北 提交于 2019-12-13 07:08:20
问题 This mail is to discuss on a use case, on which my team is working. It's to export metadata and data from a HIVE server to RDBMS. On doing that, export to MySQL and ORACLE is working good, but export to Netezza is failing with error message: 17/02/09 16:03:07 INFO DAGScheduler: Job 1 finished: json at RdbmsSandboxExecution.java:80, took 0.433405 s 17/02/09 16:03:07 INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID 3) in 143 ms on localhost (1/1) 17/02/09 16:03:07 INFO TaskSchedulerImpl

Spark-Java:How to convert Dataset string column of format “yyyy-MM-ddThh:mm:ss.SSS+0000” to timestamp with a format?

岁酱吖の 提交于 2019-12-13 04:17:13
问题 I have a Dataset with one column lastModified of type string with format " yyyy-MM-ddThh:mm:ss.SSS+0000 " (sample data: 2018-08-17T19:58:46.000+0000 ). I have to add a new column lastModif_mapped of type Timestamp by converting the lastModified 's value to format " yyyy-MM-dd hh:mm:ss.SSS ". I tried the code below, but the new column is getting the value null in it: Dataset<Row> filtered = null; filtered = ds1.select(ds1.col("id"),ds1.col("lastmodified")) .withColumn("lastModif_mapped",

Spark dataframe to nested JSON

為{幸葍}努か 提交于 2019-12-12 22:31:25
问题 I have a dataframe joinDf created from joining the following four dataframes on userId : val detailsDf = Seq((123,"first123","xyz")) .toDF("userId","firstName","address") val emailDf = Seq((123,"abc@gmail.com"), (123,"def@gmail.com")) .toDF("userId","email") val foodDf = Seq((123,"food2",false,"Italian",2), (123,"food3",true,"American",3), (123,"food1",true,"Mediterranean",1)) .toDF("userId","foodName","isFavFood","cuisine","score") val gameDf = Seq((123,"chess",false,2), (123,"football",true

Spark dataset encoders: kryo() vs bean()

半世苍凉 提交于 2019-12-12 16:34:25
问题 While working with datasets in Spark, we need to specify Encoders for serializing and de-serializing objects. We have option of using Encoders.bean(Class<T>) or Encoders.kryo(Class<T>) . How are these different and what are the performance implications of using one vs another? 回答1: It is always advisable to use Kryo Serialization to Java Serialization for many reasons. Some of them are below. Kryo Serialization is faster than Java Serialization. Kryo Serialization uses less memory footprint

Spark UDF not working with null values in Double field

荒凉一梦 提交于 2019-12-12 12:26:50
问题 I'm trying to write a spark UDF that replaces the null values of a Double field with 0.0. I'm using the Dataset API. Here's the UDF: val coalesceToZero=udf((rate: Double) => if(Option(rate).isDefined) rate else 0.0) This is based on the following function that I tested to be working fine: def cz(value: Double): Double = if(Option(value).isDefined) value else 0.0 cz(null.asInstanceOf[Double]) cz: (value: Double)Double res15: Double = 0.0 But when I use it in Spark in the following manner the

Apache Spark RDD substitution

与世无争的帅哥 提交于 2019-12-12 01:52:44
问题 I'm trying to solve a problem such that I've got a dataset like this: (1, 3) (1, 4) (1, 7) (1, 2) <- (2, 7) <- (6, 6) (3, 7) <- (7, 4) <- ... Since (1 -> 2) and (2 -> 7) , I would like to replace the set (2, 7) as (1, 7) similarly, (3 -> 7) and (7 -> 4) also replace (7,4) as (3, 4) Hence, my dataset becomes (1, 3) (1, 4) (1, 7) (1, 2) (1, 7) (6, 6) (3, 7) (3, 4) ... Any idea how to solve or tackle this ? Thanks 回答1: This problem looks like a transitive closure of a graph, represented in the

Spark excel: reading excel file with multi line header throw an exception: Method threw 'scala.MatchError' exception

拟墨画扇 提交于 2019-12-11 16:46:12
问题 I'm using spark-excel to read excel files, the problem is whenever I use a file with multilines header, the QueryExecution of the dataset throw an exception Method threw 'scala.MatchError' exception. Cannot evaluate org.apache.spark.sql.execution.QueryExecution.toString() The only solution for now is to replace the multiline header with a one line, I also tried to replace the column name in the dataset using withColumnRenamed , but it didn't work, is there any way to fix this? Here's the

Passing case class into function arguments

时光怂恿深爱的人放手 提交于 2019-12-11 16:16:05
问题 sorry for asking a simple question. I want to pass a case class to a function argument and I want to use it further inside the function. Till now I have tried this with TypeTag and ClassTag but for some reason, I am unable to properly use it or may be I am not looking at the correct place. Use cases is something similar to this: case class infoData(colA:Int,colB:String) case class someOtherData(col1:String,col2:String,col3:Int) def readCsv[T:???](path:String,passedCaseClass:???): Dataset[???]

Spark-Xml: Array within an Array in Dataframe to generate XML

走远了吗. 提交于 2019-12-11 14:56:14
问题 I have a requirement to generate a XML which has a below structure <parent> <name>parent</name <childs> <child> <name>child1</name> </child> <child> <name>child1</name> <grandchilds> <grandchild> <name>grand1</name> </grandchild> <grandchild> <name>grand2</name> </grandchild> <grandchild> <name>grand3</name> </grandchild> </grandchilds> </child> <child> <name>child1</name> </child> </childs> </parent> As you see a parent will have child(s) and a child node may have grandchild(s) nodes. https:

Dataset.reduce doesn't support shorthand function

最后都变了- 提交于 2019-12-11 10:45:32
问题 I have a simple code: test("0153") { val c = Seq(1,8,4,2,7) val max = (x:Int, y:Int)=> if (x > y) x else y c.reduce(max) } It works fine. But, when I follow the same way to use Dataset.reduce , test("SparkSQLTest") { def max(x: Int, y: Int) = if (x > y) x else y val spark = SparkSession.builder().master("local").appName("SparkSQLTest").enableHiveSupport().getOrCreate() val ds = spark.range(1, 100).map(_.toInt) ds.reduce(max) //compiling error:Error:(20, 15) missing argument list for method