PySpark - Add a new nested column or change the value of existing nested columns

问题

Supposing, I have a json file with lines in follow structure:

{
 "a": 1,
 "b": {
       "bb1": 1,
       "bb2": 2
      }
}

I want to change the value of key bb1 or add a new key, like: bb3. Currently, I use spark.read.json to load the json file into spark as DataFrame and df.rdd.map to map each row of RDD to dict. Then, change nested key value or add a nested key and convert the dict to row. Finally, convert RDD to DataFrame. The workflow works as follow:

def map_func(row):
  dictionary = row.asDict(True)
  adding new key or changing key value
  return as_row(dictionary) # as_row convert dict to row recursively

df = spark.read.json("json_file")
df.rdd.map(map_func).toDF().write.json("new_json_file")

This could work for me. But I concern that converting DataFrame -> RDD ( Row -> dict -> Row) -> DataFrame would kill the efficiency. Is there any other methods that could work for this demand but not at the cost of efficiency?

The final solution that I used is using withColumn and dynamically building the schema of b. Firstly, we can get the b_schema from df schema by:

b_schema = next(field['type'] for field in df.schema.jsonValue()['fields'] if field['name'] == 'b')

After that, b_schema is dict and we can add new field into it by:

b_schema['fields'].append({"metadata":{},"type":"string","name":"bb3","nullable":True})

And then, we could convert it to StructType by:

new_b = StructType.fromJson(b_schema)

In the map_func, we could convert Row to dict and populate the new field:

def map_func(row):
  data = row.asDict(True)
  data['bb3'] = data['bb1'] + data['bb2']
  return data

map_udf = udf(map_func, new_b)
df.withColumn('b', map_udf('b')).collect()

Thanks @Mariusz

回答1:

You can use map_func as udf and therefore omit converting DF -> RDD -> DF, still having the flexibility of python to implement business logic. All you need is to create schema object:

>>> from pyspark.sql.types import *
>>> new_b = StructType([StructField('bb1', LongType()), StructField('bb2', LongType()), StructField('bb3', LongType())])

Then you define map_func and udf:

>>> from pyspark.sql.functions import *
>>> def map_func(data):
...     return {'bb1': 4, 'bb2': 5, 'bb3': 6}
... 
>>> map_udf = udf(map_func, new_b)

Finally apply this UDF to dataframe:

>>> df = spark.read.json('sample.json')
>>> df.withColumn('b', map_udf('b')).first()
Row(a=1, b=Row(bb1=4, bb2=5, bb3=6))

EDIT:

According to the comment: You can add a field to existing StructType in a easier way, for example:

>>> df = spark.read.json('sample.json')
>>> new_b = df.schema['b'].dataType.add(StructField('bb3', LongType()))

来源：https://stackoverflow.com/questions/42203483/pyspark-add-a-new-nested-column-or-change-the-value-of-existing-nested-columns

标签

apache-spark

pyspark