问题
Hi I want to add a new column to a dafaframe which contains the list of all column names(for that row) which are not null. How do I achieve this in Scala. Please help.
val baseDF = Seq(
(3, "California", "name1", 9846, null, "SFO"),
(1, "Oregon", "name2", 9847, null, null),
(2, null, null, null, null, null)
).toDF("emp_id", "emp_city", "emp_name", "emp_phone", "emp_sal", "emp_site")
Expected output is new column named "NonNullColumns" with expected non null column names for each row:
NonNullColumns
==============
["emp_id", "emp_city", "emp_name", "emp_phone", "emp_site"]
["emp_id", "emp_city", "emp_name", "emp_phone"]
["emp_id"]
回答1:
Slight alternative using withColumn and reduce and using your DF I made them all String so as to avoid Any type issues, df used as name, and only relevant parts of code shown:
val nonNulls = df.columns.map(x => when(col(x).isNotNull, concat(lit(","), lit(x))).otherwise(",")).reduce(concat(_, _))
val df2 = df.withColumn("nonNulls", nonNulls)
val df3 = df2.withColumn("nonNullsCols", array_remove(split(col("nonNulls"),","), lit(""))).drop("nonNulls")
回答2:
I've loaded data from csv, all fields as strings.
val cols = baseDF.schema.fieldNames.map(s=>when(col(s).isNotNull, s).otherwise(""))
df.select(cols:_*).select(array_remove(array('*),"").as("NonNullColumns")).show(false)
output:
+------+----------+--------+---------+-------+--------+
|emp_id| emp_city|emp_name|emp_phone|emp_sal|emp_site|
+------+----------+--------+---------+-------+--------+
| 3|California| name1| 9846| null| SFO|
| 1| Oregon| name2| 9847| null| null|
| 2| null| null| null| null| null|
+------+----------+--------+---------+-------+--------+
+-------------------------------------------------+
|NonNullColumns |
+-------------------------------------------------+
|[emp_id, emp_city, emp_name, emp_phone, emp_site]|
|[emp_id, emp_city, emp_name, emp_phone] |
|[emp_id] |
+-------------------------------------------------+
来源:https://stackoverflow.com/questions/62127238/add-a-column-to-spark-dataframe-which-contains-list-of-all-column-names-of-the-c