Nullability in Spark sql schemas is advisory by default. What is best way to strictly enforce it?

♀尐吖头ヾ 提交于 2019-12-11 13:25:29

问题


I am working on a simple ETL project which reads CSV files, performs some modifications on each column, then writes the result out as JSON. I would like downstream processes which read my results to be confident that my output conforms to an agreed schema, but my problem is that even if I define my input schema with nullable=false for all fields, nulls can sneak in and corrupt my output files, and there seems to be no (performant) way I can make Spark enforce 'not null' for my input fields.

This seems to be a feature, as stated below in Spark, The Definitive Guide:

when you define a schema where all columns are declared to not have null values , Spark will not enforce that and will happily let null values into that column. The nullable signal is simply to help Spark SQL optimize for handling that column. If you have null values in columns that should not have null values, you can get an incorrect result or see strange exceptions that can be hard to debug.

I have written a little check utility to go through each row of a dataframe and raise an error if nulls are detected in any of the columns (at any level of nesting, in the case of fields or subfields like map, struct, or array.)

I am wondering, specifically: DID I RE-INVENT THE WHEEL WITH THIS CHECK UTILITY ? Are there any existing libraries, or Spark techniques that would do this for me (ideally in a better way than what I implemented) ?

The check utility and a simplified version of my pipeline appears below. As presented, the call to the check utility is commented out. If you run without the check utility enabled, you would see this result in /tmp/output.csv.

cat /tmp/output.json/*
(one + 1),(two + 1)
3,4
"",5

The second line after the header should be a number, but it is an empty string (which is how spark writes out the null, I guess.) This output would be problematic for downstream components that read my ETL job's output: these components just want integers.

Now, I can enable the check by un-commenting out the line

   //checkNulls(inDf)

When I do this I get an exception that informs me of the invalid null value and prints out the entirety of the offending row, like this:

        java.lang.RuntimeException: found null column value in row: [null,4]

One Possible Alternate Approach Given in Spark/Definitive Guide

Spark, The Definitive Guide mentions the possibility of doing this:

<dataframe>.na.drop() 

But this would (AFAIK) silently drop the bad records rather than flagging the bad ones. I could then do a "set subtract" on the input before and after the drop, but that seems like a heavy performance hit to find out what is null and what is not. At first glance, I'd prefer my method.... But I am still wondering if there might be some better way out there. The complete code is given below. Thanks !

package org

import java.io.PrintWriter
import org.apache.spark.SparkConf
import org.apache.spark.sql._
import org.apache.spark.sql.types._

// before running, do; rm -rf /tmp/out* /tmp/foo*
object SchemaCheckFailsToExcludeInvalidNullValue extends App {

  import NullCheckMethods._

  //val input = "2,3\n\"xxx\",4"          // this will be dropped as malformed
  val input = "2,3\n,4"                   // BUT.. this will be let through

  new PrintWriter("/tmp/foo.csv") { write(input); close }

  lazy val sparkConf = new SparkConf()
    .setAppName("Learn Spark")
    .setMaster("local[*]")
  lazy val sparkSession = SparkSession
    .builder()
    .config(sparkConf)
    .getOrCreate()
  val spark = sparkSession

  val schema = new StructType(
    Array(
      StructField("one", IntegerType, nullable = false),
      StructField("two", IntegerType, nullable = false)
    )
  )

  val inDf: DataFrame =
    spark.
      read.
      option("header", "false").
      option("mode", "dropMalformed").
      schema(schema).
      csv("/tmp/foo.csv")

  //checkNulls(inDf)

  val plusOneDf = inDf.selectExpr("one+1", "two+1")
  plusOneDf.show()

  plusOneDf.
    write.
    option("header", "true").
    csv("/tmp/output.csv")

}

object NullCheckMethods extends Serializable {

  def checkNull(columnValue: Any): Unit = {
    if (columnValue == null)
      throw new RuntimeException("got null")
    columnValue match {
      case item: Seq[_] =>
        item.foreach(checkNull)
      case item: Map[_, _] =>
        item.values.foreach(checkNull)
      case item: Row =>
        item.toSeq.foreach {
          checkNull
        }
      case default =>
        println(
          s"bad object [ $default ] of type: ${default.getClass.getName}")
    }
  }

  def checkNulls(row: Row): Unit = {
    try {
      row.toSeq.foreach {
        checkNull
      }
    } catch {
      case err: Throwable =>
        throw new RuntimeException(
          s"found null column value in row: ${row}")
    }
  }


  def checkNulls(df: DataFrame): Unit = {
    df.foreach { row => checkNulls(row) }
  }
}

回答1:


You can use the built-in Row method anyNull to split the dataframe and process both splits differently:

val plusOneNoNulls = plusOneDf.filter(!_.anyNull)
val plusOneWithNulls = plusOneDf.filter(_.anyNull)

If you don't plan to have a manual null-handling process, using the builtin DataFrame.na methods is simpler since it already implements all the usual ways to automatically handle nulls (i.e drop or fill them out with default values).



来源:https://stackoverflow.com/questions/56124274/nullability-in-spark-sql-schemas-is-advisory-by-default-what-is-best-way-to-str

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!