Spark : Create dataframe with default values

问题

Can we put a default value in a field of dataframe while creating the dataframe? I am creating a spark dataframe from List<Object[]> rows as :

List<org.apache.spark.sql.Row> sparkRows = rows.stream().map(RowFactory::create).collect(Collectors.toList());
Dataset<org.apache.spark.sql.Row> dataset = session.createDataFrame(sparkRows, schema);

While looking for a way, I found that org.apache.spark.sql.types.DataTypes contains object of org.apache.spark.sql.types.Metadata class. The documentation does not specify what is the exact purpose of the class :

/**
 * Metadata is a wrapper over Map[String, Any] that limits the value type to simple ones: Boolean,
 * Long, Double, String, Metadata, Array[Boolean], Array[Long], Array[Double], Array[String], and
 * Array[Metadata]. JSON is used for serialization.
 *
 * The default constructor is private. User should use either [[MetadataBuilder]] or
 * `Metadata.fromJson()` to create Metadata instances.
 *
 * @param map an immutable map that stores the data
 *
 * @since 1.3.0
 */

This class supports a very limited datatypes, and there is no out of the box api for making use of this class for inserting a default value while dataset creation.

Where does one use the metadata, can someone share any real life use case?

I know we can have our own map function to iterate over the rows.stream().map(RowFactory::create) and put default values. But is there any way we could do this using spark apis?

Edit : I am expecting some way similar to Oracle's DEFAULT functionality. We define a default value for each column, according to its datatype, and while creating the dataframe, if there is no value or null, then use this default value.

来源：https://stackoverflow.com/questions/57358381/spark-create-dataframe-with-default-values

标签

apache-spark

schema

apache-spark-dataset