Why is nullable = true used after some functions are executed even though there are no NaN values in the DataFrame.
You could change schema of dataframe very quickly as well. something like this would do the job -
def setNullableStateForAllColumns( df: DataFrame, columnMap: Map[String, Boolean]) : DataFrame = {
import org.apache.spark.sql.types.{StructField, StructType}
// get schema
val schema = df.schema
val newSchema = StructType(schema.map {
case StructField( c, d, n, m) =>
StructField( c, d, columnMap.getOrElse(c, default = n), m)
})
// apply new schema
df.sqlContext.createDataFrame( df.rdd, newSchema )
}
When creating Dataset from statically typed structure (without depending on schema argument) Spark uses a relatively simple set of rules to determine nullable property.
null then its DataFrame representation is nullable.Option[_] then then its DataFrame representation is nullable with None considered to be SQL NULL.nullable.Since Scala String is java.lang.String, which can be null, generated column can is nullable. For the same reason bar column is nullable in the initial dataset:
val data1 = Seq[(Int, String)]((2, "A"), (2, "B"), (1, "C"))
val df1 = data1.toDF("foo", "bar")
df1.schema("bar").nullable
Boolean = true
but foo is not (scala.Int cannot be null).
df1.schema("foo").nullable
Boolean = false
If we change data definition to:
val data2 = Seq[(Integer, String)]((2, "A"), (2, "B"), (1, "C"))
foo will be nullable (Integer is java.lang.Integer and boxed integer can be null):
data2.toDF("foo", "bar").schema("foo").nullable
Boolean = true
See also: SPARK-20668 Modify ScalaUDF to handle nullability.