apache-spark-dataset | 易学教程

Feasibility of Hive to Netezza data export using spark

阅读更多关于 Feasibility of Hive to Netezza data export using spark

问题 This mail is to discuss on a use case, on which my team is working. It's to export metadata and data from a HIVE server to RDBMS. On doing that, export to MySQL and ORACLE is working good, but export to Netezza is failing with error message: 17/02/09 16:03:07 INFO DAGScheduler: Job 1 finished: json at RdbmsSandboxExecution.java:80, took 0.433405 s 17/02/09 16:03:07 INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID 3) in 143 ms on localhost (1/1) 17/02/09 16:03:07 INFO TaskSchedulerImpl

Spark-Java:How to convert Dataset string column of format “yyyy-MM-ddThh:mm:ss.SSS+0000” to timestamp with a format?

阅读更多关于 Spark-Java:How to convert Dataset string column of format “yyyy-MM-ddThh:mm:ss.SSS+0000” to timestamp with a format?

问题 I have a Dataset with one column lastModified of type string with format " yyyy-MM-ddThh:mm:ss.SSS+0000 " (sample data: 2018-08-17T19:58:46.000+0000 ). I have to add a new column lastModif_mapped of type Timestamp by converting the lastModified 's value to format " yyyy-MM-dd hh:mm:ss.SSS ". I tried the code below, but the new column is getting the value null in it: Dataset<Row> filtered = null; filtered = ds1.select(ds1.col("id"),ds1.col("lastmodified")) .withColumn("lastModif_mapped",

Spark dataframe to nested JSON

阅读更多关于 Spark dataframe to nested JSON

问题 I have a dataframe joinDf created from joining the following four dataframes on userId : val detailsDf = Seq((123,"first123","xyz")) .toDF("userId","firstName","address") val emailDf = Seq((123,"abc@gmail.com"), (123,"def@gmail.com")) .toDF("userId","email") val foodDf = Seq((123,"food2",false,"Italian",2), (123,"food3",true,"American",3), (123,"food1",true,"Mediterranean",1)) .toDF("userId","foodName","isFavFood","cuisine","score") val gameDf = Seq((123,"chess",false,2), (123,"football",true

Spark dataset encoders: kryo() vs bean()

阅读更多关于 Spark dataset encoders: kryo() vs bean()

问题 While working with datasets in Spark, we need to specify Encoders for serializing and de-serializing objects. We have option of using Encoders.bean(Class<T>) or Encoders.kryo(Class<T>) . How are these different and what are the performance implications of using one vs another? 回答1: It is always advisable to use Kryo Serialization to Java Serialization for many reasons. Some of them are below. Kryo Serialization is faster than Java Serialization. Kryo Serialization uses less memory footprint

Spark UDF not working with null values in Double field

阅读更多关于 Spark UDF not working with null values in Double field

问题 I'm trying to write a spark UDF that replaces the null values of a Double field with 0.0. I'm using the Dataset API. Here's the UDF: val coalesceToZero=udf((rate: Double) => if(Option(rate).isDefined) rate else 0.0) This is based on the following function that I tested to be working fine: def cz(value: Double): Double = if(Option(value).isDefined) value else 0.0 cz(null.asInstanceOf[Double]) cz: (value: Double)Double res15: Double = 0.0 But when I use it in Spark in the following manner the

Apache Spark RDD substitution

阅读更多关于 Apache Spark RDD substitution

问题 I'm trying to solve a problem such that I've got a dataset like this: (1, 3) (1, 4) (1, 7) (1, 2) <- (2, 7) <- (6, 6) (3, 7) <- (7, 4) <- ... Since (1 -> 2) and (2 -> 7) , I would like to replace the set (2, 7) as (1, 7) similarly, (3 -> 7) and (7 -> 4) also replace (7,4) as (3, 4) Hence, my dataset becomes (1, 3) (1, 4) (1, 7) (1, 2) (1, 7) (6, 6) (3, 7) (3, 4) ... Any idea how to solve or tackle this ? Thanks 回答1: This problem looks like a transitive closure of a graph, represented in the

Spark excel: reading excel file with multi line header throw an exception: Method threw 'scala.MatchError' exception

阅读更多关于 Spark excel: reading excel file with multi line header throw an exception: Method threw 'scala.MatchError' exception

问题 I'm using spark-excel to read excel files, the problem is whenever I use a file with multilines header, the QueryExecution of the dataset throw an exception Method threw 'scala.MatchError' exception. Cannot evaluate org.apache.spark.sql.execution.QueryExecution.toString() The only solution for now is to replace the multiline header with a one line, I also tried to replace the column name in the dataset using withColumnRenamed , but it didn't work, is there any way to fix this? Here's the

Passing case class into function arguments

阅读更多关于 Passing case class into function arguments

问题 sorry for asking a simple question. I want to pass a case class to a function argument and I want to use it further inside the function. Till now I have tried this with TypeTag and ClassTag but for some reason, I am unable to properly use it or may be I am not looking at the correct place. Use cases is something similar to this: case class infoData(colA:Int,colB:String) case class someOtherData(col1:String,col2:String,col3:Int) def readCsv[T:???](path:String,passedCaseClass:???): Dataset[???]

Spark-Xml: Array within an Array in Dataframe to generate XML

阅读更多关于 Spark-Xml: Array within an Array in Dataframe to generate XML

问题 I have a requirement to generate a XML which has a below structure <parent> <name>parent</name <childs> <child> <name>child1</name> </child> <child> <name>child1</name> <grandchilds> <grandchild> <name>grand1</name> </grandchild> <grandchild> <name>grand2</name> </grandchild> <grandchild> <name>grand3</name> </grandchild> </grandchilds> </child> <child> <name>child1</name> </child> </childs> </parent> As you see a parent will have child(s) and a child node may have grandchild(s) nodes. https:

Dataset.reduce doesn't support shorthand function

阅读更多关于 Dataset.reduce doesn't support shorthand function

问题 I have a simple code: test("0153") { val c = Seq(1,8,4,2,7) val max = (x:Int, y:Int)=> if (x > y) x else y c.reduce(max) } It works fine. But, when I follow the same way to use Dataset.reduce , test("SparkSQLTest") { def max(x: Int, y: Int) = if (x > y) x else y val spark = SparkSession.builder().master("local").appName("SparkSQLTest").enableHiveSupport().getOrCreate() val ds = spark.range(1, 100).map(_.toInt) ds.reduce(max) //compiling error:Error:(20, 15) missing argument list for method