scala | 易学教程

Creting UDF function with NonPrimitive Data Type and using in Spark-sql Query: Scala

阅读更多关于 Creting UDF function with NonPrimitive Data Type and using in Spark-sql Query: Scala

问题 I am creating one function in scala which i want to use in my spark-sql query.my query is working fine in hive or if i am giving the same query in spark sql but the same query i'm using at multiple places so i want to create it as reusable function/method so whenever its required i can just call it. I have created below function in my scala class. def date_part(date_column:Column) = { val m1: Column = month(to_date(from_unixtime(unix_timestamp(date_column, "dd-MM-yyyy")))) //give value as 01

Kryo vs Encoder vs Java Serialization in Spark?

阅读更多关于 Kryo vs Encoder vs Java Serialization in Spark?

问题 Which serialization is used for which case, From spark documentation it says : It provides two serialization libraries: 1. Java(default) and 2. Kryo Now where did Encoders come from and why is it not given in the doc. And also from databricks it says Encoders performs faster for Datasets,what about RDD, and how do all these maps together. In which case which serializer should we use? 回答1: Encoders are used in Dataset only. Kryo is used internally in spark. Kryo and Java serialization is

Scala: Get every combination of the last 24 months

阅读更多关于 Scala: Get every combination of the last 24 months

问题 I'm trying to generate a DataFrame in Spark (but perhaps just Scala is enough) in which I have every combination of the last 24 months where the second year-month is always > the first year-month. For example, it is the 1 March 2019 as of writing this, I'm after something like: List( (2017, 3, 2017, 4), (2017, 3, 2017, 5), (2017, 3, 2017, 6), // .. (2017, 3, 2019, 3), (2017, 4, 2017, 5), // .. (2019, 1, 2019, 3), (2019, 2, 2019, 3), ) 回答1: This is easiest done with pure Scala without

How to restrict method parameter to subclass type in Scala

阅读更多关于 How to restrict method parameter to subclass type in Scala

问题 I have a trait GameStatistics that defines an add() method that takes a parameter and returns the sum of itself and the parameter. Implementations in subclasses should only accept instances of their own type as a parameter (or maybe also subtypes). I would like to use this add method to aggregate lists of GameStatistics , using Seq's reduce method. I have not been able to define this in Scala and make it compile. Below is one example that I tried plus its compile errors. The errors don't make

How to restrict method parameter to subclass type in Scala

阅读更多关于 How to restrict method parameter to subclass type in Scala

Convert Scala code to Pyspark :Word2Vec Scala Tranform Routine

阅读更多关于 Convert Scala code to Pyspark :Word2Vec Scala Tranform Routine

问题 I want to translate following routine from class [Word2VecModel]https://github.com/apache/spark/blob/branch-2.3/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala into pyspark. override def transform(dataset: Dataset[_]): DataFrame = { transformSchema(dataset.schema, logging = true) val vectors = wordVectors.getVectors .mapValues(vv => Vectors.dense(vv.map(_.toDouble))) .map(identity) // mapValues doesn't return a serializable map (SI-7005) val bVectors = dataset.sparkSession

Serialize table to nested JSON using Apache Spark

阅读更多关于 Serialize table to nested JSON using Apache Spark

问题 I have a set of records like the following sample |ACCOUNTNO|VEHICLENUMBER|CUSTOMERID| +---------+-------------+----------+ | 10003014| MH43AJ411| 20000000| | 10003014| MH43AJ411| 20000001| | 10003015| MH12GZ3392| 20000002| I want to parse into JSON and it should be look like this: { "ACCOUNTNO":10003014, "VEHICLE": [ { "VEHICLENUMBER":"MH43AJ411", "CUSTOMERID":20000000}, { "VEHICLENUMBER":"MH43AJ411", "CUSTOMERID":20000001} ], "ACCOUNTNO":10003015, "VEHICLE": [ { "VEHICLENUMBER":"MH12GZ3392"

Why value method cannot be used outside macros?

阅读更多关于 Why value method cannot be used outside macros?

问题 The error message `value` can only be used within a task or setting macro, such as :=, +=, ++=, Def.task, or Def.setting. val x = version.value ^ clearly indicates how to fix the problem, for example, using := val x = settingKey[String]("") x := version.value The explanation in sbt uses macros heavily states The value method itself is in fact a macro, one that if you invoke it outside of the context of another macro, will result in a compile time error, the exact error message being... And

Spark Dataframe stat throwing Task not serializable

阅读更多关于 Spark Dataframe stat throwing Task not serializable

问题 What am I trying to do? (Context) I'm trying to calculate some stats for a dataframe/set in spark that is read from a directory with .parquet files about US flights between 2013 and 2015. To be more specific, I'm using approxQuantile method in DataFrameStatFunction that can be accessed calling stat method on a Dataset . See docu import airportCaseStudy.model.Flight import org.apache.spark.sql.SparkSession object CaseStudy { def main(args: Array[String]): Unit = { val spark: SparkSession =

How to add a Map column to Spark dataset?

阅读更多关于 How to add a Map column to Spark dataset?

问题 I have a Java Map variable, say Map<String, String> singleColMap . I want to add this Map variable to a dataset as a new column value in Spark 2.2 (Java 1.8). I tried the below code but it is not working: ds.withColumn("cMap", lit(singleColMap).cast(MapType(StringType, StringType))) Can some one help on this? 回答1: You can use typedLit that was introducted in Spark 2.2.0 , from the documentation: The difference between this function and lit is that this function can handle parameterized scala