How to Iterate each column in a Dataframe in Spark Scala

你离开我真会死。 提交于 2019-12-08 03:58:19

问题


Suppose I have a dataframe with multiple columns, I want to iterate each column, do some calculation and update that column. Is there any good way to do that?


回答1:


Update In below example I have a dataframe with two integer columns c1 and c2. each column's value is divided with the sum of its columns.

import org.apache.spark.sql.expressions.Window
val df = Seq((1,15), (2,20), (3,30)).toDF("c1","c2")
val result = df.columns.foldLeft(df)((acc, colname) => acc.withColumn(colname, sum(acc(colname)).over(Window.orderBy(lit(1)))/acc(colname)))

Output:

scala> result.show()
+---+------------------+
| c1|                c2|
+---+------------------+
|6.0| 4.333333333333333|
|3.0|              3.25|
|2.0|2.1666666666666665|
+---+------------------+



回答2:


@rogue-one has already answered your queries, you just need to modify the answer to meet your requirements.

Following is the solution by not using Window function.

val df = List(
  (2, 28),
  (1, 21),
  (7, 42)
).toDF("col1", "col2")

Your input dataframe should look like

+----+----+
|col1|col2|
+----+----+
|2   |28  |
|1   |21  |
|7   |42  |
+----+----+

Now to apply columnValue/sumOfColumnValues do as

val columnsModify = df.columns.map(col).map(colName => {
  val total = df.select(sum(colName)).first().get(0)
  colName/total as(s"${colName}")
})

df.select(columnsModify: _*).show(false)

You should get ouput as

+----+-------------------+
|col1|col2               |
+----+-------------------+
|0.2 |0.3076923076923077 |
|0.1 |0.23076923076923078|
|0.7 |0.46153846153846156|
+----+-------------------+


来源:https://stackoverflow.com/questions/44730081/how-to-iterate-each-column-in-a-dataframe-in-spark-scala

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!