write a spark Dataset to json with all keys in the schema, including null columns

问题

I am writing a dataset to json using:

ds.coalesce(1).write.format("json").option("nullValue",null).save("project/src/test/resources")

For records that have columns with null values, the json document does not write that key at all.

Is there a way to enforce null value keys to the json output?

This is needed since I use this json to read it onto another dataset (in a test case) and cannot enforce a schema if some documents do not have all the keys in the case class (I am reading it by putting the json file under resources folder and transforming to a dataset via RDD[String], as explained here: https://databaseline.bitbucket.io/a-quickie-on-reading-json-resource-files-in-apache-spark/)

回答1:

I agree with @philantrovert.

ds.na.fill("")
  .coalesce(1)
  .write
  .format("json")
  .save("project/src/test/resources")

Since DataSets are immutable you are not altering the data in ds and you can process it (complete with null values and all) in any following code. You are simply replacing null values with an empty string in the saved file.

来源：https://stackoverflow.com/questions/45235593/write-a-spark-dataset-to-json-with-all-keys-in-the-schema-including-null-column

标签

json

scala

apache-spark

databricks

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!