How to get count of invalid data during parse

问题

We are using spark to parse a big csv file, which may contain invalid data. We want to save valid data into the data store, and also return how many valid data we imported and how many invalid data.

I am wondering how we can do this in spark, what's the standard approach when reading data?

My current approach uses Accumulator, but it's not accurate due to how Accumulator works in spark.

// we define case class CSVInputData: all fields are defined as string
val csvInput = spark.read.option("header", "true").csv(csvFile).as[CSVInputData]

val newDS = csvInput
  .flatMap { row =>
    Try {
      val data = new DomainData()
      data.setScore(row.score.trim.toDouble)
      data.setId(UUID.randomUUID().toString())
      data.setDate(Util.parseDate(row.eventTime.trim))
      data.setUpdateDate(new Date())
      data
    } match {
      case Success(map) => Seq(map)
      case _ => {
        errorAcc.add(1)
        Seq()
      }
    }
}

I tried to use Either, but it failed with the exception:

java.lang.NoClassDefFoundError: no Java class corresponding to Product with Serializable with scala.util.Either[xx.CSVInputData,xx.DomainData] found

Update

I think Either doesn't work with spark 2.0 dataset api:

      spark.read.option("header", "true").csv("any.csv").map { row => 
      try {
        Right("")
      } catch {  case e: Throwable => Left(""); }
      }

If we change to use sc(rdd api), it works:

      sc.parallelize('a' to 'z').map { row => 
      try {
        Right("")
      } catch {  case e: Throwable => Left(""); }
      }.collect()

In current latest scala http://www.scala-lang.org/api/2.11.x/index.html#scala.util.Either: Either doesn't implements Serializable trait

sealed abstract class Either[+A, +B] extends AnyRef

In future 2.12 http://www.scala-lang.org/api/2.12.x/scala/util/Either.html, it does:

sealed abstract class Either[+A, +B] extends Product with Serializable

Updated 2 with workaround

More info at Spark ETL: Using Either to handle invalid data

As spark dataset doesn't work with Either, so the workaround is to call ds.rdd, then use try-left-right to capture both valid and invalid data.

   spark.read.option("header", "true").csv("/Users/yyuan/jyuan/1.csv").rdd.map ( { row => 
   try {
     Right("")
   } catch {  case e: Throwable => Left(""); }
   }).collect()

回答1:

Have you considered using an Either

val counts = csvInput
      .map { row =>
        try {
          val data = new DomainData()
          data.setScore(row.score.trim.toDouble)
          data.setId(UUID.randomUUID().toString())
          data.setDate(Util.parseDate(row.eventTime.trim))
          data.setUpdateDate(new Date())
          Right(data)
        } catch {
          case e: Throwable => Left(row)
        }
      }
      val failedCount = counts.map(_.left).filter(_.e.isLeft).count()
      val successCount = counts.map(_.right).filter(_.e.isRight).count()

回答2:

Did you try Spark DDQ - this has most of the data quality rules that you will need. You can even extend and customize it.

Link: https://github.com/FRosner/drunken-data-quality

来源：https://stackoverflow.com/questions/39755613/how-to-get-count-of-invalid-data-during-parse

标签

scala

apache-spark

spark-dataframe

bigdata