How to change column metadata in pyspark?

后端 未结 1 1449
慢半拍i
慢半拍i 2020-12-11 20:35

How can I update column metadata in PySpark? I have metadata values corresponding to nominal encoding of categorical (string) features and I would like to decode them back i

相关标签:
1条回答
  • 2020-12-11 20:52

    In both cases losing metadata is expected:

    • When you call Python udf there is no relationship between input Column and its metadata, and output Column. UserDefinedFunction (both in Python and Scala) are black boxes for the Spark engine.
    • Assigning data directly to the Python schema object:

      df2.schema.fields[1].metadata = extract(df.schema.fields[1].metadata)
      

      is not a valid approach at all. Spark DataFrame is a thing wrapper around JVM object. Any changes in the Python wrappers, are completely opaque for JVM backend, and won't be propagated at all:

      import json 
      
      df = spark.createDataFrame([(1, "foo")], ("k", "v"))
      df.schema[-1].metadata = {"foo": "bar"}
      
      json.loads(df._jdf.schema().json())
      
      ## {'fields': [{'metadata': {}, 'name': 'k', 'nullable': True, 'type': 'long'},
      ##   {'metadata': {}, 'name': 'v', 'nullable': True, 'type': 'string'}],
      ## 'type': 'struct'}
      

      or even preserved in Python:

      df.select("*").schema[-1].metadata
      ## {}
      

    With Spark < 2.2 you can use a small wrapper (taken from Spark Gotchas, maintained by me and @eliasah):

    def withMeta(self, alias, meta):
        sc = SparkContext._active_spark_context
        jmeta = sc._gateway.jvm.org.apache.spark.sql.types.Metadata
        return Column(getattr(self._jc, "as")(alias, jmeta.fromJson(json.dumps(meta))))
    
    df.withColumn("foo", withMeta(col("foo"), "", {...}))
    

    With Spark >= 2.2 you can use Column.alias:

    df.withColumn("foo", col("foo").alias("", metadata={...}))
    
    0 讨论(0)
提交回复
热议问题