How to add sparse vectors after group by, using Spark SQL?

流过昼夜 提交于 2019-12-04 08:09:56

Starting from indexed, we can collect the column newsIndex as a list and transform it into a SparseVector using an udf.

To declare a sparse vector, we need the number of features and a list of tuples containing the position and the value. Because we are dealing with a categorical variable, for value we will use is 1.0. And the index will be the column newsIndex:

from pyspark.sql.functions import collect_list, max, lit
from pyspark.ml.linalg import Vectors, VectorUDT

def encode(arr, length):

  vec_args =  length, [(x,1.0) for x in arr]

  return Vectors.sparse(*vec_args)   

encode_udf = udf(encode, VectorUDT())

The number of features is max(newsIndex) + 1 (since StrinIndexer begins at 0.0):

feats = indexed.agg(max(indexed["newsIndex"])).take(1)[0][0] + 1

Bringing it all together:

indexed.groupBy("uuid") \
       .agg(collect_list("newsIndex")
       .alias("newsArr")) \
       .select("uuid", 
               encode_udf("newsArr", lit(feats))
               .alias("OHE")) \
       .show(truncate = False)
+---------------+-----------------------------------------+
|uuid           |OHE                                      |
+---------------+-----------------------------------------+
|009092130698762|(24,[0],[1.0])                           |
|010003000431538|(24,[0,3,15],[1.0,1.0,1.0])              |
|010720006581483|(24,[11],[1.0])                          |
|010216216021063|(24,[10,22],[1.0,1.0])                   |
|001436800277225|(24,[2,12,23],[1.0,1.0,1.0])             |
|011425002581540|(24,[1,5,9],[1.0,1.0,1.0])               |
|010156461231357|(24,[13,18],[1.0,1.0])                   |
|011199797794333|(24,[7,8,17,19,20],[1.0,1.0,1.0,1.0,1.0])|
|011414545455156|(24,[4,6,14,21],[1.0,1.0,1.0,1.0])       |
|011337201765123|(24,[1,16],[1.0,1.0])                    |
+---------------+-----------------------------------------+
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!