How can missing columns be added as null while read a nested JSON using pyspark and a predefined struct schema

问题

Python=3.6

Spark=2.4

My sample JSON data:

{"data":{"header":"someheader","body":{"name":"somename","value":"somevalue","books":[{"name":"somename"},{"value":"somevalue"},{"author":"someauthor"}]}}},
{"data":{"header":"someheader1","body":{"name":"somename1","value":"somevalue1","books":[{"name":"somename1"},{"value":"somevalue1"},{"author":"someauthor1"}]}}},....

My Struct Schema:

Schema = StructType([StructField('header',StringType(),True),StructField('body',StructType([StructField('name1',StringType(),True),StructField('value',StringType(),True),StructField('books',ArrayType(StructType([StructField('name1',StringType(),True),StructField('value',StringType(),True),StructField('author',StringType(),True),StructField('publisher',StringType(),True)]),True),True)]),True)])

I want to pass this schema and be able to have all the fields including the ones that are missing in the data populated as NULL.

Because, it may happen that for a certain's day's load, any of the input data does not have the author column inside book array of struct field.

So if I don't use the schema, spark wont be able to infer the column as any of the input data does not have it.

Here is what I tried,

1> df = spark.read.schema(schema).json('/input/data/path')

This gives me all null rows as the input files has header and body inside data field and data field does not exists in the struct schema

2> df = spark.read.json('/input/data/path').select(col("data.*")) df.coalesce(1).write.json('/output/path') df2 = spark.read.schema(schema).json('/output/path')

This also gives me all null rows as the struct schema has extra columns that does not exists in the data.

3> df = spark.read.json('/input/data/path').select(col("data.*")) df2 = spark.createDataFrame(df.rdd, schema)

This fails, as for this to work, the ordering of the columns and all nested columns need to be exactly same in both data and schema which is not feasible.

4> In this approach, I tried to read the data without schema from input and write it back to temporary path.

Then read the data again with schema from input which gives me all null rows and then replace null values with '1' and then write into same temp path in append mode.

Then read again from this temp path and let spark infer the schema.

But this also did not work as the nested struct null columns are not getting replaced with non-null values and when I write it , the output path does not have all columns.

df.coalesce(1).write.json('/output/path')
df_input_with_schema = spark.read.schema(schema).json('/input/data/path') --all null rows
df_input_with_schema.na.fill('1').format('json').write.mode('append').save('/output/path')
df_final = spark.read.json('/output/path').filter(col("keycolumn") === 1)

Can someone please help?

回答1:

Since spark 3 there is the ignoreNullFields option ( by default at True )

you can do :

df.coalesce(1).write.mode('overwrite').json(ignoreNullFields=False,path="a")

to keep the columns with only null values

来源：https://stackoverflow.com/questions/63870745/how-can-missing-columns-be-added-as-null-while-read-a-nested-json-using-pyspark

标签

python

json

apache-spark