Make VectorAssembler always choose DenseVector

自作多情 提交于 2019-12-23 06:28:10

问题


This is the structure of my dataframe using df.columns.

['LastName',
 'FirstName',
 'Stud. ID',
 '10 Relations',
 'Related to Politics',
 '3NF',
 'Documentation & Scripts',
 'SQL',
 'Data (CSV, etc.)',
 '20 Relations',
 'Google News',
 'Cheated',
 'Sum',
 'Delay Factor',
 'Grade (out of 2)']

I have transformed this dataframe in pyspark using

assembler = VectorAssembler(inputCols=['10 Relations',
 'Related to Politics',
 '3NF'],outputCol='features')

and output = assembler.transform(df). Now it contains some Row objects. These objects have this architecture (This is what I get when I run output.printSchema())

root
 |-- LastName: string (nullable = true)
 |-- FirstName: string (nullable = true)
 |-- Stud. ID: integer (nullable = true)
 |-- 10 Relations: integer (nullable = true)
 |-- Related to Politics: integer (nullable = true)
 |-- 3NF: integer (nullable = true)
 |-- Documentation & Scripts: integer (nullable = true)
 |-- SQL: integer (nullable = true)
 |-- Data (CSV, etc.): integer (nullable = true)
 |-- 20 Relations: integer (nullable = true)
 |-- Google News: integer (nullable = true)
 |-- Cheated: integer (nullable = true)
 |-- Sum: integer (nullable = true)
 |-- Delay Factor: double (nullable = true)
 |-- Grade (out of 2): double (nullable = true)
 |-- features: vector (nullable = true)

For each row, the assembler chooses to make the features vector Sparse or Dense (For memory reasons). But this is a big problem. Because I want to use this transformed data for making a linear regression model. So, I'm searching for a way to make VectorAssembler always choose Dense Vector.

Any idea?

Note: I have read this post. But the problem is that since the Row class is a subclass of tuple, I cannot change a Row object after it is made.


回答1:


Sparse and Dense vector are both inherited from pyspark.ml.linalg.Vector. So both vector types have .toarray() method in common. You can convert them into numpy array then Dense vetor with simple udf.

from pyspark.ml.linalg import DenseVector, SparseVector, Vectors, VectorUDT
from pyspark.sql import functions as F
from pyspark.sql.types import *


v = Vectors.dense([1,3]) # dense vector
u = SparseVector(2, {}) # sparse vector

# toDense function converts both vector  type into Dense Vector
toDense = lambda v: Vectors.dense(v.toArray()) 
toDense(u), toDense(v)

Results:

DenseVector([0.0, 0.0]), DenseVector([1.0, 3.0])

Then You can create udf with this function.

df = sqlContext.createDataFrame([
    ((v,)), 
    ((u,))
   ], ['feature'])

toDense = lambda v: Vectors.dense(v.toArray())
toDenseUdf = F.udf(toDense, VectorUDT())
df.withColumn('feature', toDenseUdf('feature')).show()

results:

+---------+
|  feature|
+---------+
|[1.0,3.0]|
|[0.0,0.0]|
+---------+

You have single vectortype in column.



来源:https://stackoverflow.com/questions/51317473/make-vectorassembler-always-choose-densevector

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!