How can I update column metadata in PySpark? I have metadata values corresponding to nominal encoding of categorical (string) features and I would like to decode them back i
In both cases losing metadata is expected:
udf there is no relationship between input Column and its metadata, and output Column. UserDefinedFunction (both in Python and Scala) are black boxes for the Spark engine.Assigning data directly to the Python schema object:
df2.schema.fields[1].metadata = extract(df.schema.fields[1].metadata)
is not a valid approach at all. Spark DataFrame is a thing wrapper around JVM object. Any changes in the Python wrappers, are completely opaque for JVM backend, and won't be propagated at all:
import json
df = spark.createDataFrame([(1, "foo")], ("k", "v"))
df.schema[-1].metadata = {"foo": "bar"}
json.loads(df._jdf.schema().json())
## {'fields': [{'metadata': {}, 'name': 'k', 'nullable': True, 'type': 'long'},
## {'metadata': {}, 'name': 'v', 'nullable': True, 'type': 'string'}],
## 'type': 'struct'}
or even preserved in Python:
df.select("*").schema[-1].metadata
## {}
With Spark < 2.2 you can use a small wrapper (taken from Spark Gotchas, maintained by me and @eliasah):
def withMeta(self, alias, meta):
sc = SparkContext._active_spark_context
jmeta = sc._gateway.jvm.org.apache.spark.sql.types.Metadata
return Column(getattr(self._jc, "as")(alias, jmeta.fromJson(json.dumps(meta))))
df.withColumn("foo", withMeta(col("foo"), "", {...}))
With Spark >= 2.2 you can use Column.alias:
df.withColumn("foo", col("foo").alias("", metadata={...}))