How can I update column metadata in PySpark? I have metadata values corresponding to nominal encoding of categorical (string) features and I would like to decode them back i
In both cases losing metadata is expected:
udf
there is no relationship between input Column
and its metadata, and output Column
. UserDefinedFunction
(both in Python and Scala) are black boxes for the Spark engine.Assigning data directly to the Python schema object:
df2.schema.fields[1].metadata = extract(df.schema.fields[1].metadata)
is not a valid approach at all. Spark DataFrame
is a thing wrapper around JVM object. Any changes in the Python wrappers, are completely opaque for JVM backend, and won't be propagated at all:
import json
df = spark.createDataFrame([(1, "foo")], ("k", "v"))
df.schema[-1].metadata = {"foo": "bar"}
json.loads(df._jdf.schema().json())
## {'fields': [{'metadata': {}, 'name': 'k', 'nullable': True, 'type': 'long'},
## {'metadata': {}, 'name': 'v', 'nullable': True, 'type': 'string'}],
## 'type': 'struct'}
or even preserved in Python:
df.select("*").schema[-1].metadata
## {}
With Spark < 2.2 you can use a small wrapper (taken from Spark Gotchas, maintained by me and @eliasah):
def withMeta(self, alias, meta):
sc = SparkContext._active_spark_context
jmeta = sc._gateway.jvm.org.apache.spark.sql.types.Metadata
return Column(getattr(self._jc, "as")(alias, jmeta.fromJson(json.dumps(meta))))
df.withColumn("foo", withMeta(col("foo"), "", {...}))
With Spark >= 2.2 you can use Column.alias
:
df.withColumn("foo", col("foo").alias("", metadata={...}))