Why is nullable = true
used after some functions are executed even though there are no NaN values in the DataFrame
.
val myDf = Seq((2,"A"),(2,"B"),(1,"C"))
.toDF("foo","bar")
.withColumn("foo", 'foo.cast("Int"))
myDf.withColumn("foo_2", when($"foo" === 2 , 1).otherwise(0)).select("foo", "foo_2").show
When df.printSchema
is called now nullable
will be false
for both columns.
val foo: (Int => String) = (t: Int) => {
fooMap.get(t) match {
case Some(tt) => tt
case None => "notFound"
}
}
val fooMap = Map(
1 -> "small",
2 -> "big"
)
val fooUDF = udf(foo)
myDf
.withColumn("foo", fooUDF(col("foo")))
.withColumn("foo_2", when($"foo" === 2 , 1).otherwise(0)).select("foo", "foo_2")
.select("foo", "foo_2")
.printSchema
However now, nullable
is true
for at least one column which was false
before. How can this be explained?
When creating Dataset
from statically typed structure (without depending on schema
argument) Spark uses a relatively simple set of rules to determine nullable
property.
- If object of the given type can be
null
then itsDataFrame
representation isnullable
. - If object is an
Option[_]
then then itsDataFrame
representation isnullable
withNone
considered to be SQLNULL
. - In any other case it will be marked as not
nullable
.
Since Scala String
is java.lang.String
, which can be null
, generated column can is nullable
. For the same reason bar
column is nullable
in the initial dataset:
val data1 = Seq[(Int, String)]((2, "A"), (2, "B"), (1, "C"))
val df1 = data1.toDF("foo", "bar")
df1.schema("bar").nullable
Boolean = true
but foo
is not (scala.Int
cannot be null
).
df1.schema("foo").nullable
Boolean = false
If we change data definition to:
val data2 = Seq[(Integer, String)]((2, "A"), (2, "B"), (1, "C"))
foo
will be nullable
(Integer
is java.lang.Integer
and boxed integer can be null
):
data2.toDF("foo", "bar").schema("foo").nullable
Boolean = true
See also: SPARK-20668 Modify ScalaUDF to handle nullability.
You could change schema of dataframe very quickly as well. something like this would do the job -
def setNullableStateForAllColumns( df: DataFrame, columnMap: Map[String, Boolean]) : DataFrame = {
import org.apache.spark.sql.types.{StructField, StructType}
// get schema
val schema = df.schema
val newSchema = StructType(schema.map {
case StructField( c, d, n, m) =>
StructField( c, d, columnMap.getOrElse(c, default = n), m)
})
// apply new schema
df.sqlContext.createDataFrame( df.rdd, newSchema )
}
来源:https://stackoverflow.com/questions/40603756/why-do-columns-change-to-nullable-in-apache-spark-sql