I just used Standard Scaler to normalize my features for a ML application. After selecting the scaled features, I want to convert this back to a dataframe of Doubles, though the length of my vectors are arbitrary. I know how to do it for a specific 3 features by using
myDF.map{case Row(v: Vector) => (v(0), v(1), v(2))}.toDF("f1", "f2", "f3")
but not for an arbitrary amount of features. Is there an easy way to do this?
Example:
val testDF = sc.parallelize(List(Vectors.dense(5D, 6D, 7D), Vectors.dense(8D, 9D, 10D), Vectors.dense(11D, 12D, 13D))).map(Tuple1(_)).toDF("scaledFeatures")
val myColumnNames = List("f1", "f2", "f3")
// val finalDF = DataFrame[f1: Double, f2: Double, f3: Double]
EDIT
I found out how to unpack to column names when creating the dataframe, but still am having trouble converting a vector to a sequence needed to create the dataframe:
finalDF = testDF.map{case Row(v: Vector) => v.toArray.toSeq /* <= this errors */}.toDF(List("f1", "f2", "f3"): _*)
One possible approach is something similar to this
import org.apache.spark.sql.functions.udf
import org.apache.spark.mllib.linalg.Vector
// Get size of the vector
val n = testDF.first.getAs[org.apache.spark.mllib.linalg.Vector](0).size
// Simple helper to convert vector to array<double>
val vecToSeq = udf((v: Vector) => v.toArray)
// Prepare a list of columns to create
val exprs = (0 until n).map(i => $"_tmp".getItem(i).alias(s"f$i"))
testDF.select(vecToSeq($"scaledFeatures").alias("_tmp")).select(exprs:_*)
If you know a list of columns upfront you can simplify this a little:
val cols: Seq[String] = ???
val exprs = cols.zipWithIndex.map{ case (c, i) => $"_tmp".getItem(i).alias(c) }
For Python equivalent see How to split Vector into columns - using PySpark.
Alternate solution that evovled couple of days ago: Import the VectorDisassembler
into your project (as long as it's not merged into Spark), now:
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
val dataset = spark.createDataFrame(
Seq((0, 1.2, 1.3), (1, 2.2, 2.3), (2, 3.2, 3.3))
).toDF("id", "val1", "val2")
val assembler = new VectorAssembler()
.setInputCols(Array("val1", "val2"))
.setOutputCol("vectorCol")
val output = assembler.transform(dataset)
output.show()
/*
+---+----+----+---------+
| id|val1|val2|vectorCol|
+---+----+----+---------+
| 0| 1.2| 1.3|[1.2,1.3]|
| 1| 2.2| 2.3|[2.2,2.3]|
| 2| 3.2| 3.3|[3.2,3.3]|
+---+----+----+---------+*/
val disassembler = new org.apache.spark.ml.feature.VectorDisassembler()
.setInputCol("vectorCol")
disassembler.transform(output).show()
/*
+---+----+----+---------+----+----+
| id|val1|val2|vectorCol|val1|val2|
+---+----+----+---------+----+----+
| 0| 1.2| 1.3|[1.2,1.3]| 1.2| 1.3|
| 1| 2.2| 2.3|[2.2,2.3]| 2.2| 2.3|
| 2| 3.2| 3.3|[3.2,3.3]| 3.2| 3.3|
+---+----+----+---------+----+----+*/
Please try VectorSlicer :
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
val dataset = spark.createDataFrame(
Seq((1, 0.2, 0.8), (2, 0.1, 0.9), (3, 0.3, 0.7))
).toDF("id", "negative_logit", "positive_logit")
val assembler = new VectorAssembler()
.setInputCols(Array("negative_logit", "positive_logit"))
.setOutputCol("prediction")
val output = assembler.transform(dataset)
output.show()
/*
+---+--------------+--------------+----------+
| id|negative_logit|positive_logit|prediction|
+---+--------------+--------------+----------+
| 1| 0.2| 0.8| [0.2,0.8]|
| 2| 0.1| 0.9| [0.1,0.9]|
| 3| 0.3| 0.7| [0.3,0.7]|
+---+--------------+--------------+----------+
*/
val slicer = new VectorSlicer()
.setInputCol("prediction")
.setIndices(Array(1))
.setOutputCol("positive_prediction")
val posi_output = slicer.transform(output)
posi_output.show()
/*
+---+--------------+--------------+----------+-------------------+
| id|negative_logit|positive_logit|prediction|positive_prediction|
+---+--------------+--------------+----------+-------------------+
| 1| 0.2| 0.8| [0.2,0.8]| [0.8]|
| 2| 0.1| 0.9| [0.1,0.9]| [0.9]|
| 3| 0.3| 0.7| [0.3,0.7]| [0.7]|
+---+--------------+--------------+----------+-------------------+
*/
来源:https://stackoverflow.com/questions/38110038/spark-scala-how-to-convert-dataframevector-to-dataframef1double-fn-d