Scala Spark - split vector column into separate columns in a Spark DataFrame

爷,独闯天下 提交于 2019-11-29 11:53:42

One way could be to convert the vector column to an array<double> and then using getItem to extract individual elements.

import org.apache.spark.sql.functions._
import org.apache.spark.ml._

val df = Seq( (1 , linalg.Vectors.dense(1,0,1,1,0) ) ).toDF("id", "features")
//df: org.apache.spark.sql.DataFrame = [id: int, features: vector]

df.show
//+---+---------------------+
//|id |features             |
//+---+---------------------+
//|1  |[1.0,0.0,1.0,1.0,0.0]|
//+---+---------------------+

// A UDF to convert VectorUDT to ArrayType
val vecToArray = udf( (xs: linalg.Vector) => xs.toArray )

// Add a ArrayType Column   
val dfArr = df.withColumn("featuresArr" , vecToArray($"features") )

// Array of element names that need to be fetched
// ArrayIndexOutOfBounds is not checked.
// sizeof `elements` should be equal to the number of entries in column `features`
val elements = Array("f1", "f2", "f3", "f4", "f5")

// Create a SQL-like expression using the array 
val sqlExpr = elements.zipWithIndex.map{ case (alias, idx) => col("featuresArr").getItem(idx).as(alias) }

// Extract Elements from dfArr    
dfArr.select(sqlExpr : _*).show
//+---+---+---+---+---+
//| f1| f2| f3| f4| f5|
//+---+---+---+---+---+
//|1.0|0.0|1.0|1.0|0.0|
//+---+---+---+---+---+
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!