convert Dense Vector to Sparse Vector in PySpark

旧巷老猫 提交于 2020-02-27 12:42:01

问题


Is there a built in way to create a sparse vector from a dense vector in PySpark? The way I am doing this is the following:

Vectors.sparse(len(denseVector), [(i,j) for i,j in enumerate(denseVector)  if j != 0 ])

That satisfies the [size, (index, data)] format. Seems kinda hacky. Is there a more efficient way to do it?


回答1:


import scipy.sparse
from pyspark.ml.linalg import Vectors, _convert_to_vector, VectorUDT
from pyspark.sql.functions import udf, col

If you have just one dense vector this will do it:

def dense_to_sparse(vector):
    return _convert_to_vector(scipy.sparse.csc_matrix(vector.toArray()).T)

dense_to_sparse(densevector)

The trick here is that csc_matrix.shape[1] has to equal 1, so transpose the vector. Have a look at the source of _convert_to_vector: https://people.eecs.berkeley.edu/~jegonzal/pyspark/_modules/pyspark/mllib/linalg.html

The more likely scenario is you have a DF with a column of densevectors:

to_sparse = udf(dense_to_sparse, VectorUDT())
DF.withColumn("sparse", to_sparse(col("densevector"))


来源:https://stackoverflow.com/questions/44186939/convert-dense-vector-to-sparse-vector-in-pyspark

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!