I have data in a parquet file which has 2 fields: object_id: String and alpha: Map<>.
It is read into a data frame in sparkSQL and th
And if you are in PySpark, I just find an easy implementation:
from pyspark.sql.functions import map_keys
alphaDF.select(map_keys("ALPHA").alias("keys")).show()
You can check details in here
Spark >= 2.3
You can simplify the process using map_keys function:
import org.apache.spark.sql.functions.map_keys
There is also map_values function, but it won't be directly useful here.
Spark < 2.3
General method can be expressed in a few steps. First required imports:
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.Row
and example data:
val ds = Seq(
(1, Map("foo" -> (1, "a"), "bar" -> (2, "b"))),
(2, Map("foo" -> (3, "c"))),
(3, Map("bar" -> (4, "d")))
).toDF("id", "alpha")
To extract keys we can use UDF (Spark < 2.3)
val map_keys = udf[Seq[String], Map[String, Row]](_.keys.toSeq)
or built-in functions
import org.apache.spark.sql.functions.map_keys
val keysDF = df.select(map_keys($"alpha"))
Find distinct ones:
val distinctKeys = keysDF.as[Seq[String]].flatMap(identity).distinct
.collect.sorted
You can also generalize keys extraction with explode:
import org.apache.spark.sql.functions.explode
val distinctKeys = df
// Flatten the column into key, value columns
.select(explode($"alpha"))
.select($"key")
.as[String].distinct
.collect.sorted
And select:
ds.select($"id" +: distinctKeys.map(x => $"alpha".getItem(x).alias(x)): _*)