Multiplying two columns in a pyspark dataframe. One of them contains a vector and one of them contains a constant

问题

I have a pyspark dataframe which has one Column with vector values and one column with constant numerical values. Say for example

A | B
1 | [2,4,5]
5 | [6,5,3]

I want to multiple the vector column with the constant column. I’m trying to do this basically cause I have word wmbeddings in the B column and some weights in the A column. And my final purpose to get weighted embeddings.

回答1:

If your vector data is stored as an array of doubles, you can do this:

import breeze.linalg.{Vector => BV}

val data = spark.createDataset(Seq(
    (1, Array[Double](2, 4, 5)),
    (5, Array[Double](6, 5, 3))
  )).toDF("A", "B")

data.as[(Long, Array[Double])].map(r => {
  (BV(r._2) * r._1.toDouble).toArray
}).show()

Which becomes

+------------------+
|             value|
+------------------+
|   [2.0, 4.0, 5.0]|
|[30.0, 25.0, 15.0]|
+------------------+

回答2:

Spark 2.4 onwards, you can use the higher order functions available in sql.

scala> val df = Seq((1,Seq(2,4,5)),(5,Seq(6,5,3))).toDF("a","b")
df: org.apache.spark.sql.DataFrame = [a: int, b: array<int>]

scala> df.createOrReplaceTempView("ashima")

scala> spark.sql(""" select a, b, transform(b, x -> x * a) as result from ashima """).show(false)
+---+---------+------------+
|a  |b        |result      |
+---+---------+------------+
|1  |[2, 4, 5]|[2, 4, 5]   |
|5  |[6, 5, 3]|[30, 25, 15]|
+---+---------+------------+


scala>

来源：https://stackoverflow.com/questions/54954584/multiplying-two-columns-in-a-pyspark-dataframe-one-of-them-contains-a-vector-an

标签

vector

pyspark

apache-spark-sql

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!