scala

Logger is not working inside spark UDF on cluster

陌路散爱 提交于 2021-02-10 15:54:51
问题 I have placed log.info statements inside my UDF but it is getting failed on cluster. Local working fine. Here is the snippet: def relType = udf((colValue: String, relTypeV: String) => { var relValue = "NA" val relType = relTypeV.split(",").toList val relTypeMap = relType.map { col => val split = col.split(":") (split(0), split(1)) }.toMap // val keySet = relTypeMap relTypeMap.foreach { x => if ((x._1 != null || colValue != null || x._1.trim() != "" || colValue.trim() != "") && colValue

HList foldLeft with tuple as zero

僤鯓⒐⒋嵵緔 提交于 2021-02-10 15:46:48
问题 I'm trying to foldLeft on a HList with an accumulator of type (HL, Int) , where HL is a HList. The program below does not compile. However, if I switch to a simpler accumulator of type HL (by just switching the commented lines with the ones above), it compiles and it works. Wrapping an HList in a tuple breaks the implicit resolution for the leftFolder. What am I missing? package foo.bar import shapeless.{:+:, ::, CNil, Coproduct, Generic, HList, HNil, Lazy, Poly2} import shapeless.ops.hlist.

Mutual Authentication in Scala with Akka

只谈情不闲聊 提交于 2021-02-10 15:14:12
问题 I would create a TLS Session in Scala using Akka with mutual authentication between a client and a server. I have created two CA certificate that have to trust the respective certificates incoming from the other part. Could you give me an exemple of how implement this? Thank you. 回答1: I created a github project which demonstrates mutual authentication with different kind of clients, including Akka. Please have a look here: https://github.com/Hakky54/mutual-tls-ssl It contains a full example

Spark: subtract dataframes but preserve duplicate values

南笙酒味 提交于 2021-02-10 14:51:08
问题 Suppose I have two Spark SQL dataframes A and B . I want to subtract the items in B from the items in A while preserving duplicates from A . I followed the instructions to use DataFrame.except() that I found in another StackOverflow question ("Spark: subtract two DataFrames"), but that function removes all duplicates from the original dataframe A . As a conceptual example, if I have two dataframes: words = [the, quick, fox, a, brown, fox] stopWords = [the, a] then I want the output to be, in

In Scala - How to get the day of the week?

谁说我不能喝 提交于 2021-02-10 14:43:55
问题 Suppose my date format is 21/05/2017 then the output will be SUN. How can I get the day given a date? 回答1: import java.time.LocalDate import java.time.format.DateTimeFormatter val df = DateTimeFormatter.ofPattern("dd/MM/yyyy") val dayOfWeek = LocalDate.parse("21/05/2017",df).getDayOfWeek 回答2: You can use SimpleDateFormat as illustrated below: import java.util.Calendar import java.text.SimpleDateFormat val now = Calendar.getInstance.getTime val date = new SimpleDateFormat("yyyy-MM-dd") date

How can I configure spark so that it creates “_$folder$” entries in S3?

青春壹個敷衍的年華 提交于 2021-02-10 14:39:47
问题 When I write my dataframe to S3 using df.write .format("parquet") .mode("overwrite") .partitionBy("year", "month", "day", "hour", "gen", "client") .option("compression", "gzip") .save("s3://xxxx/yyyy") I get the following in S3 year=2018 year=2019 but I would like to have this instead: year=2018 year=2018_$folder$ year=2019 year=2019_$folder$ The scripts that are reading from that S3 location depend on the *_$folder$ entries, but I haven't found a way to configure spark/hadoop to generate

Scala cannot infer

☆樱花仙子☆ 提交于 2021-02-10 14:33:30
问题 I have a very simple snipper of Spark code which was working on Scala 2.11 and stop compiling after 2.12. import spark.implicits._ val ds = Seq("val").toDF("col1") ds.foreachPartition(part => { part.foreach(println) }) It fails with the error: Error:(22, 12) value foreach is not a member of Object part.foreach(println) The workaround is to help the compiler with such code: import spark.implicits._ val ds = Seq("val").toDF("col1") println(ds.getClass) ds.foreachPartition((part: Iterator[Row])

Thread Safety in Scala reflection with type matching

无人久伴 提交于 2021-02-10 14:16:32
问题 Working in scala 2.11.12, JDK 1.8.0_131, I have been able to replicate a thread safety bug observed in Apache Spark with the following code, in which I repeatedly check with multiple threads whether Option[Int] can be matched via <:< to Option[_] : package stuff import java.util.concurrent.{Executors, Future} import scala.collection.mutable.ListBuffer object Main { val universe: scala.reflect.runtime.universe.type = scala.reflect.runtime.universe import universe._ def mirror: universe.Mirror

Thread Safety in Scala reflection with type matching

▼魔方 西西 提交于 2021-02-10 14:15:47
问题 Working in scala 2.11.12, JDK 1.8.0_131, I have been able to replicate a thread safety bug observed in Apache Spark with the following code, in which I repeatedly check with multiple threads whether Option[Int] can be matched via <:< to Option[_] : package stuff import java.util.concurrent.{Executors, Future} import scala.collection.mutable.ListBuffer object Main { val universe: scala.reflect.runtime.universe.type = scala.reflect.runtime.universe import universe._ def mirror: universe.Mirror

Inherit from a class parametrized by an inner type

末鹿安然 提交于 2021-02-10 14:14:35
问题 I would like to have a class B that inherits from a generic class A that is parametrized by an inner type of B . Specifically, I would like this (minimized example): class A[T] class B extends A[T] { class T } Written like this, the compiler does not accept it. Is there any way to specify this inheritance relationship? (Using some different syntax, or some tricks.) If not, what would be an official reference documenting that this is not possible? Notes: Yes, I want T to be an inner class. I