Add a column to spark dataframe which contains list of all column names of the current row whose value is not null

筅森魡賤 提交于 2021-02-11 14:55:34

问题


Hi I want to add a new column to a dafaframe which contains the list of all column names(for that row) which are not null. How do I achieve this in Scala. Please help.

val baseDF = Seq(
(3, "California", "name1", 9846, null, "SFO"),
(1, "Oregon", "name2", 9847, null, null),
(2, null, null, null, null, null)
).toDF("emp_id", "emp_city", "emp_name", "emp_phone", "emp_sal", "emp_site")

Expected output is new column named "NonNullColumns" with expected non null column names for each row:

NonNullColumns 
==============
["emp_id", "emp_city", "emp_name", "emp_phone", "emp_site"]
["emp_id", "emp_city", "emp_name", "emp_phone"]
["emp_id"]

回答1:


Slight alternative using withColumn and reduce and using your DF I made them all String so as to avoid Any type issues, df used as name, and only relevant parts of code shown:

val nonNulls = df.columns.map(x => when(col(x).isNotNull, concat(lit(","), lit(x))).otherwise(",")).reduce(concat(_, _))
val df2 = df.withColumn("nonNulls", nonNulls) 
val df3 = df2.withColumn("nonNullsCols", array_remove(split(col("nonNulls"),","), lit(""))).drop("nonNulls")



回答2:


I've loaded data from csv, all fields as strings.

val cols = baseDF.schema.fieldNames.map(s=>when(col(s).isNotNull, s).otherwise(""))
df.select(cols:_*).select(array_remove(array('*),"").as("NonNullColumns")).show(false)

output:

+------+----------+--------+---------+-------+--------+
|emp_id|  emp_city|emp_name|emp_phone|emp_sal|emp_site|
+------+----------+--------+---------+-------+--------+
|     3|California|   name1|     9846|   null|     SFO|
|     1|    Oregon|   name2|     9847|   null|    null|
|     2|      null|    null|     null|   null|    null|
+------+----------+--------+---------+-------+--------+

+-------------------------------------------------+
|NonNullColumns                                   |
+-------------------------------------------------+
|[emp_id, emp_city, emp_name, emp_phone, emp_site]|
|[emp_id, emp_city, emp_name, emp_phone]          |
|[emp_id]                                         |
+-------------------------------------------------+


来源:https://stackoverflow.com/questions/62127238/add-a-column-to-spark-dataframe-which-contains-list-of-all-column-names-of-the-c

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!