scala

Spark Scala Cosine Similarity Matrix

无人久伴 提交于 2020-12-04 14:09:59
问题 New to scala ( pyspark guy) and trying to calculated cosine similarity between rows (items) Followed this to create a sample df as an example: Spark, Scala, DataFrame: create feature vectors import org.apache.spark.ml.feature.VectorAssembler val df = sc.parallelize(Seq( (1, "cat1", 1), (1, "cat2", 3), (1, "cat9", 5), (2, "cat4", 6), (2, "cat9", 2), (2, "cat10", 1), (3, "cat1", 5), (3, "cat7", 16), (3, "cat8", 2))).toDF("userID", "category", "frequency") // Create a sorted array of categories

Spark Scala Cosine Similarity Matrix

自古美人都是妖i 提交于 2020-12-04 14:09:55
问题 New to scala ( pyspark guy) and trying to calculated cosine similarity between rows (items) Followed this to create a sample df as an example: Spark, Scala, DataFrame: create feature vectors import org.apache.spark.ml.feature.VectorAssembler val df = sc.parallelize(Seq( (1, "cat1", 1), (1, "cat2", 3), (1, "cat9", 5), (2, "cat4", 6), (2, "cat9", 2), (2, "cat10", 1), (3, "cat1", 5), (3, "cat7", 16), (3, "cat8", 2))).toDF("userID", "category", "frequency") // Create a sorted array of categories

Spark Scala Cosine Similarity Matrix

不羁岁月 提交于 2020-12-04 14:09:52
问题 New to scala ( pyspark guy) and trying to calculated cosine similarity between rows (items) Followed this to create a sample df as an example: Spark, Scala, DataFrame: create feature vectors import org.apache.spark.ml.feature.VectorAssembler val df = sc.parallelize(Seq( (1, "cat1", 1), (1, "cat2", 3), (1, "cat9", 5), (2, "cat4", 6), (2, "cat9", 2), (2, "cat10", 1), (3, "cat1", 5), (3, "cat7", 16), (3, "cat8", 2))).toDF("userID", "category", "frequency") // Create a sorted array of categories

Spark Scala Cosine Similarity Matrix

你离开我真会死。 提交于 2020-12-04 14:09:02
问题 New to scala ( pyspark guy) and trying to calculated cosine similarity between rows (items) Followed this to create a sample df as an example: Spark, Scala, DataFrame: create feature vectors import org.apache.spark.ml.feature.VectorAssembler val df = sc.parallelize(Seq( (1, "cat1", 1), (1, "cat2", 3), (1, "cat9", 5), (2, "cat4", 6), (2, "cat9", 2), (2, "cat10", 1), (3, "cat1", 5), (3, "cat7", 16), (3, "cat8", 2))).toDF("userID", "category", "frequency") // Create a sorted array of categories

Why is a type of the member of the object different in a function?

廉价感情. 提交于 2020-12-04 10:19:49
问题 Code below produces following result: as member: nested: AnyRef{def x: Int; def x_=(x$1: Int): Unit} as local: nested: Object (Tested with Scala 2.12.12 and Scala 2.12.3) Can someone explain why? object Main extends App { def getNestedType(m: Any) = { import scala.reflect.runtime.currentMirror for { symbol <- currentMirror.classSymbol(m.getClass).toType.members if symbol.isTerm && !symbol.isMethod && !symbol.isModule } yield { s"{symbol.name.decodedName}: ${symbol.info}" } } object obj { var

Why is a type of the member of the object different in a function?

爱⌒轻易说出口 提交于 2020-12-04 10:18:27
问题 Code below produces following result: as member: nested: AnyRef{def x: Int; def x_=(x$1: Int): Unit} as local: nested: Object (Tested with Scala 2.12.12 and Scala 2.12.3) Can someone explain why? object Main extends App { def getNestedType(m: Any) = { import scala.reflect.runtime.currentMirror for { symbol <- currentMirror.classSymbol(m.getClass).toType.members if symbol.isTerm && !symbol.isMethod && !symbol.isModule } yield { s"{symbol.name.decodedName}: ${symbol.info}" } } object obj { var

How to group large stream into sub streams

五迷三道 提交于 2020-12-04 08:56:55
问题 I want to group large Stream[F, A] into Stream[Stream[F, A]] with at most n element for inner stream. This is what I did, basically pipe chunks into Queue[F, Queue[F, Chunk[A]] , and then yields queue elements as result stream. implicit class StreamSyntax[F[_], A](s: Stream[F, A])( implicit F: Concurrent[F]) { def groupedPipe( lastQRef: Ref[F, Queue[F, Option[Chunk[A]]]], n: Int): Pipe[F, A, Stream[F, A]] = { in => val initQs = Queue.unbounded[F, Option[Queue[F, Option[Chunk[A]]]]].flatMap {

How to group large stream into sub streams

自古美人都是妖i 提交于 2020-12-04 08:56:31
问题 I want to group large Stream[F, A] into Stream[Stream[F, A]] with at most n element for inner stream. This is what I did, basically pipe chunks into Queue[F, Queue[F, Chunk[A]] , and then yields queue elements as result stream. implicit class StreamSyntax[F[_], A](s: Stream[F, A])( implicit F: Concurrent[F]) { def groupedPipe( lastQRef: Ref[F, Queue[F, Option[Chunk[A]]]], n: Int): Pipe[F, A, Stream[F, A]] = { in => val initQs = Queue.unbounded[F, Option[Queue[F, Option[Chunk[A]]]]].flatMap {

【快速上手scala】(五)Map和Tuple

情到浓时终转凉″ 提交于 2020-12-04 03:53:51
目录 Map基本操作 Tuple元组 一、Map基本操作 scala的Map和java一样,是一种key->value的数据结构 //新建一个不可变的Map scala> val person = Map("xiaoli"->15,"xiaofang"->18) person: scala.collection.immutable.Map[String,Int] = Map(xiaoli -> 15, xiaofang -> 18) scala> person("xiaoli") res1: Int = 15 //接下来创建一个可变的Map scala> val person = scala.collection.mutable.Map("xiaoli"->15,"xiaofang" ->18) person: scala.collection.mutable.Map[String,Int] = Map(xiaoli -> 15, xiaofang -> 18) scala> person("xiaofang") res3: Int = 18 //我们也可以这样创建Map scala> val person = Map(("xiaoxiao",22),("xiaowang",30)) person: scala.collection.immutable.Map[String,Int

【spark系列3】spark 3.0.1 AQE(Adaptive Query Exection)分析

不羁岁月 提交于 2020-12-01 19:43:58
AQE简介 从 spark configuration ,到在最早在spark 1.6版本就已经有了AQE;到了spark 2.x版本,intel大数据团队进行了相应的原型开发和实践;到了spark 3.0时代,Databricks和intel一起为社区贡献了新的AQE spark 3.0.1中的AQE的配置 配置项 默认值 官方说明 分析 spark.sql.adaptive.enabled false 是否开启自适应查询 此处设置为true开启 spark.sql.adaptive.coalescePartitions.enabled true 是否合并临近的shuffle分区(根据'spark.sql.adaptive.advisoryPartitionSizeInBytes'的阈值来合并) 此处默认为true开启,分析见: 分析1 spark.sql.adaptive.coalescePartitions.initialPartitionNum (none) shuffle合并分区之前的初始分区数,默认为spark.sql.shuffle.partitions的值 分析见:分析2 spark.sql.adaptive.coalescePartitions.minPartitionNum (none) shuffle 分区合并后的最小分区数,默认为spark集群的默认并行度