After:
val df = Seq((1, Vector(2, 3, 4)), (1, Vector(2, 3, 4))).toDF(\"Col1\", \"Col2\")
I have this DataFrame in Apache Spark:
You can use a map:
df.map {
case Row(col1: Int, col2: mutable.WrappedArray[Int]) => (col1, col2(0), col2(1), col2(2))
}.toDF("Col1", "Col2", "Col3", "Col4").show()
Just add on to sgvd's solution:
If the size is not always the same, you can set nElements like this:
val nElements = df.select(size('Col2).as("Col2_count"))
.select(max("Col2_count"))
.first.getInt(0)
A solution that doesn't convert to and from RDD:
df.select($"Col1", $"Col2"(0) as "Col2", $"Col2"(1) as "Col3", $"Col2"(2) as "Col3")
Or arguable nicer:
val nElements = 3
df.select(($"Col1" +: Range(0, nElements).map(idx => $"Col2"(idx) as "Col" + (idx + 2)):_*))
The size of a Spark array column is not fixed, you could for instance have:
+----+------------+
|Col1| Col2|
+----+------------+
| 1| [2, 3, 4]|
| 1|[2, 3, 4, 5]|
+----+------------+
So there is no way to get the amount of columns and create those. If you know the size is always the same, you can set nElements like this:
val nElements = df.select("Col2").first.getList(0).size
Just to give the Pyspark version of sgvd's answer. If the array column is in Col2, then this select statement will move the first nElements of each array in Col2 to their own columns:
from pyspark.sql import functions as F
df.select([F.col('Col2').getItem(i) for i in range(nElements)])