scala

Creting UDF function with NonPrimitive Data Type and using in Spark-sql Query: Scala

梦想与她 提交于 2021-02-08 11:00:42
问题 I am creating one function in scala which i want to use in my spark-sql query.my query is working fine in hive or if i am giving the same query in spark sql but the same query i'm using at multiple places so i want to create it as reusable function/method so whenever its required i can just call it. I have created below function in my scala class. def date_part(date_column:Column) = { val m1: Column = month(to_date(from_unixtime(unix_timestamp(date_column, "dd-MM-yyyy")))) //give value as 01

Kryo vs Encoder vs Java Serialization in Spark?

廉价感情. 提交于 2021-02-08 10:40:35
问题 Which serialization is used for which case, From spark documentation it says : It provides two serialization libraries: 1. Java(default) and 2. Kryo Now where did Encoders come from and why is it not given in the doc. And also from databricks it says Encoders performs faster for Datasets,what about RDD, and how do all these maps together. In which case which serializer should we use? 回答1: Encoders are used in Dataset only. Kryo is used internally in spark. Kryo and Java serialization is

Scala: Get every combination of the last 24 months

笑着哭i 提交于 2021-02-08 10:20:17
问题 I'm trying to generate a DataFrame in Spark (but perhaps just Scala is enough) in which I have every combination of the last 24 months where the second year-month is always > the first year-month. For example, it is the 1 March 2019 as of writing this, I'm after something like: List( (2017, 3, 2017, 4), (2017, 3, 2017, 5), (2017, 3, 2017, 6), // .. (2017, 3, 2019, 3), (2017, 4, 2017, 5), // .. (2019, 1, 2019, 3), (2019, 2, 2019, 3), ) 回答1: This is easiest done with pure Scala without

How to restrict method parameter to subclass type in Scala

会有一股神秘感。 提交于 2021-02-08 10:12:49
问题 I have a trait GameStatistics that defines an add() method that takes a parameter and returns the sum of itself and the parameter. Implementations in subclasses should only accept instances of their own type as a parameter (or maybe also subtypes). I would like to use this add method to aggregate lists of GameStatistics , using Seq's reduce method. I have not been able to define this in Scala and make it compile. Below is one example that I tried plus its compile errors. The errors don't make

How to restrict method parameter to subclass type in Scala

非 Y 不嫁゛ 提交于 2021-02-08 10:02:06
问题 I have a trait GameStatistics that defines an add() method that takes a parameter and returns the sum of itself and the parameter. Implementations in subclasses should only accept instances of their own type as a parameter (or maybe also subtypes). I would like to use this add method to aggregate lists of GameStatistics , using Seq's reduce method. I have not been able to define this in Scala and make it compile. Below is one example that I tried plus its compile errors. The errors don't make

Convert Scala code to Pyspark :Word2Vec Scala Tranform Routine

泄露秘密 提交于 2021-02-08 10:01:31
问题 I want to translate following routine from class [Word2VecModel]https://github.com/apache/spark/blob/branch-2.3/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala into pyspark. override def transform(dataset: Dataset[_]): DataFrame = { transformSchema(dataset.schema, logging = true) val vectors = wordVectors.getVectors .mapValues(vv => Vectors.dense(vv.map(_.toDouble))) .map(identity) // mapValues doesn't return a serializable map (SI-7005) val bVectors = dataset.sparkSession

Serialize table to nested JSON using Apache Spark

白昼怎懂夜的黑 提交于 2021-02-08 09:47:09
问题 I have a set of records like the following sample |ACCOUNTNO|VEHICLENUMBER|CUSTOMERID| +---------+-------------+----------+ | 10003014| MH43AJ411| 20000000| | 10003014| MH43AJ411| 20000001| | 10003015| MH12GZ3392| 20000002| I want to parse into JSON and it should be look like this: { "ACCOUNTNO":10003014, "VEHICLE": [ { "VEHICLENUMBER":"MH43AJ411", "CUSTOMERID":20000000}, { "VEHICLENUMBER":"MH43AJ411", "CUSTOMERID":20000001} ], "ACCOUNTNO":10003015, "VEHICLE": [ { "VEHICLENUMBER":"MH12GZ3392"

Why value method cannot be used outside macros?

走远了吗. 提交于 2021-02-08 09:37:12
问题 The error message `value` can only be used within a task or setting macro, such as :=, +=, ++=, Def.task, or Def.setting. val x = version.value ^ clearly indicates how to fix the problem, for example, using := val x = settingKey[String]("") x := version.value The explanation in sbt uses macros heavily states The value method itself is in fact a macro, one that if you invoke it outside of the context of another macro, will result in a compile time error, the exact error message being... And

Spark Dataframe stat throwing Task not serializable

大城市里の小女人 提交于 2021-02-08 09:24:06
问题 What am I trying to do? (Context) I'm trying to calculate some stats for a dataframe/set in spark that is read from a directory with .parquet files about US flights between 2013 and 2015. To be more specific, I'm using approxQuantile method in DataFrameStatFunction that can be accessed calling stat method on a Dataset . See docu import airportCaseStudy.model.Flight import org.apache.spark.sql.SparkSession object CaseStudy { def main(args: Array[String]): Unit = { val spark: SparkSession =

How to add a Map column to Spark dataset?

做~自己de王妃 提交于 2021-02-08 09:15:43
问题 I have a Java Map variable, say Map<String, String> singleColMap . I want to add this Map variable to a dataset as a new column value in Spark 2.2 (Java 1.8). I tried the below code but it is not working: ds.withColumn("cMap", lit(singleColMap).cast(MapType(StringType, StringType))) Can some one help on this? 回答1: You can use typedLit that was introducted in Spark 2.2.0 , from the documentation: The difference between this function and lit is that this function can handle parameterized scala