I\'m getting the following error trying to build a ML Pipeline:
pyspark.sql.utils.IllegalArgumentException: \'requirement failed: Column feature
You can use UDF:
udf(lambda vs: Vectors.dense(vs), VectorUDT())
In Spark < 2.0 import:
from pyspark.mllib.linalg import Vectors, VectorUDT
In Spark 2.0+ import:
from pyspark.ml.linalg import Vectors, VectorUDT
Please note that these classes are not compatible despite identical implementation.
It is also possible to extract individual features and assemble with VectorAssembler. Assuming input column is called features:
from pyspark.ml.feature import VectorAssembler
n = ... # Size of features
assembler = VectorAssembler(
inputCols=["features[{0}]".format(i) for i in range(n)],
outputCol="features_vector")
assembler.transform(df.select(
"*", *(df["features"].getItem(i) for i in range(n))
))