scala | 易学教程

Spark Scala Cosine Similarity Matrix

阅读更多关于 Spark Scala Cosine Similarity Matrix

问题 New to scala ( pyspark guy) and trying to calculated cosine similarity between rows (items) Followed this to create a sample df as an example: Spark, Scala, DataFrame: create feature vectors import org.apache.spark.ml.feature.VectorAssembler val df = sc.parallelize(Seq( (1, "cat1", 1), (1, "cat2", 3), (1, "cat9", 5), (2, "cat4", 6), (2, "cat9", 2), (2, "cat10", 1), (3, "cat1", 5), (3, "cat7", 16), (3, "cat8", 2))).toDF("userID", "category", "frequency") // Create a sorted array of categories

Spark Scala Cosine Similarity Matrix

阅读更多关于 Spark Scala Cosine Similarity Matrix

Spark Scala Cosine Similarity Matrix

阅读更多关于 Spark Scala Cosine Similarity Matrix

Spark Scala Cosine Similarity Matrix

阅读更多关于 Spark Scala Cosine Similarity Matrix

Why is a type of the member of the object different in a function?

阅读更多关于 Why is a type of the member of the object different in a function?

问题 Code below produces following result: as member: nested: AnyRef{def x: Int; def x_=(x$1: Int): Unit} as local: nested: Object (Tested with Scala 2.12.12 and Scala 2.12.3) Can someone explain why? object Main extends App { def getNestedType(m: Any) = { import scala.reflect.runtime.currentMirror for { symbol <- currentMirror.classSymbol(m.getClass).toType.members if symbol.isTerm && !symbol.isMethod && !symbol.isModule } yield { s"{symbol.name.decodedName}: ${symbol.info}" } } object obj { var

Why is a type of the member of the object different in a function?

阅读更多关于 Why is a type of the member of the object different in a function?

How to group large stream into sub streams

阅读更多关于 How to group large stream into sub streams

问题 I want to group large Stream[F, A] into Stream[Stream[F, A]] with at most n element for inner stream. This is what I did, basically pipe chunks into Queue[F, Queue[F, Chunk[A]] , and then yields queue elements as result stream. implicit class StreamSyntax[F[_], A](s: Stream[F, A])( implicit F: Concurrent[F]) { def groupedPipe( lastQRef: Ref[F, Queue[F, Option[Chunk[A]]]], n: Int): Pipe[F, A, Stream[F, A]] = { in => val initQs = Queue.unbounded[F, Option[Queue[F, Option[Chunk[A]]]]].flatMap {

How to group large stream into sub streams

阅读更多关于 How to group large stream into sub streams

【快速上手scala】（五）Map和Tuple

阅读更多关于【快速上手scala】（五）Map和Tuple

目录 Map基本操作 Tuple元组一、Map基本操作 scala的Map和java一样，是一种key->value的数据结构 //新建一个不可变的Map scala> val person = Map("xiaoli"->15,"xiaofang"->18) person: scala.collection.immutable.Map[String,Int] = Map(xiaoli -> 15, xiaofang -> 18) scala> person("xiaoli") res1: Int = 15 //接下来创建一个可变的Map scala> val person = scala.collection.mutable.Map("xiaoli"->15,"xiaofang" ->18) person: scala.collection.mutable.Map[String,Int] = Map(xiaoli -> 15, xiaofang -> 18) scala> person("xiaofang") res3: Int = 18 //我们也可以这样创建Map scala> val person = Map(("xiaoxiao",22),("xiaowang",30)) person: scala.collection.immutable.Map[String,Int

【spark系列3】spark 3.0.1 AQE(Adaptive Query Exection)分析

阅读更多关于【spark系列3】spark 3.0.1 AQE(Adaptive Query Exection)分析

AQE简介从 spark configuration ,到在最早在spark 1.6版本就已经有了AQE;到了spark 2.x版本，intel大数据团队进行了相应的原型开发和实践；到了spark 3.0时代，Databricks和intel一起为社区贡献了新的AQE spark 3.0.1中的AQE的配置配置项默认值官方说明分析 spark.sql.adaptive.enabled false 是否开启自适应查询此处设置为true开启 spark.sql.adaptive.coalescePartitions.enabled true 是否合并临近的shuffle分区（根据'spark.sql.adaptive.advisoryPartitionSizeInBytes'的阈值来合并）此处默认为true开启，分析见: 分析1 spark.sql.adaptive.coalescePartitions.initialPartitionNum (none) shuffle合并分区之前的初始分区数，默认为spark.sql.shuffle.partitions的值分析见:分析2 spark.sql.adaptive.coalescePartitions.minPartitionNum (none) shuffle 分区合并后的最小分区数，默认为spark集群的默认并行度