问题
This is the structure of my dataframe using df.columns
.
['LastName',
'FirstName',
'Stud. ID',
'10 Relations',
'Related to Politics',
'3NF',
'Documentation & Scripts',
'SQL',
'Data (CSV, etc.)',
'20 Relations',
'Google News',
'Cheated',
'Sum',
'Delay Factor',
'Grade (out of 2)']
I have transformed this dataframe in pyspark using
assembler = VectorAssembler(inputCols=['10 Relations',
'Related to Politics',
'3NF'],outputCol='features')
and output = assembler.transform(df)
. Now it contains some Row objects. These objects have this architecture (This is what I get when I run output.printSchema()
)
root
|-- LastName: string (nullable = true)
|-- FirstName: string (nullable = true)
|-- Stud. ID: integer (nullable = true)
|-- 10 Relations: integer (nullable = true)
|-- Related to Politics: integer (nullable = true)
|-- 3NF: integer (nullable = true)
|-- Documentation & Scripts: integer (nullable = true)
|-- SQL: integer (nullable = true)
|-- Data (CSV, etc.): integer (nullable = true)
|-- 20 Relations: integer (nullable = true)
|-- Google News: integer (nullable = true)
|-- Cheated: integer (nullable = true)
|-- Sum: integer (nullable = true)
|-- Delay Factor: double (nullable = true)
|-- Grade (out of 2): double (nullable = true)
|-- features: vector (nullable = true)
For each row, the assembler chooses to make the features vector Sparse or Dense (For memory reasons). But this is a big problem. Because I want to use this transformed data for making a linear regression model. So, I'm searching for a way to make VectorAssembler always choose Dense Vector.
Any idea?
Note: I have read this post. But the problem is that since the Row class is a subclass of tuple, I cannot change a Row object after it is made.
回答1:
Sparse and Dense vector are both inherited from pyspark.ml.linalg.Vector. So both vector types have .toarray()
method in common. You can convert them into numpy array then Dense vetor with simple udf.
from pyspark.ml.linalg import DenseVector, SparseVector, Vectors, VectorUDT
from pyspark.sql import functions as F
from pyspark.sql.types import *
v = Vectors.dense([1,3]) # dense vector
u = SparseVector(2, {}) # sparse vector
# toDense function converts both vector type into Dense Vector
toDense = lambda v: Vectors.dense(v.toArray())
toDense(u), toDense(v)
Results:
DenseVector([0.0, 0.0]), DenseVector([1.0, 3.0])
Then You can create udf with this function.
df = sqlContext.createDataFrame([
((v,)),
((u,))
], ['feature'])
toDense = lambda v: Vectors.dense(v.toArray())
toDenseUdf = F.udf(toDense, VectorUDT())
df.withColumn('feature', toDenseUdf('feature')).show()
results:
+---------+
| feature|
+---------+
|[1.0,3.0]|
|[0.0,0.0]|
+---------+
You have single vectortype in column.
来源:https://stackoverflow.com/questions/51317473/make-vectorassembler-always-choose-densevector