问题
We are using spark to parse a big csv file, which may contain invalid data. We want to save valid data into the data store, and also return how many valid data we imported and how many invalid data.
I am wondering how we can do this in spark, what's the standard approach when reading data?
My current approach uses Accumulator
, but it's not accurate due to how Accumulator
works in spark.
// we define case class CSVInputData: all fields are defined as string
val csvInput = spark.read.option("header", "true").csv(csvFile).as[CSVInputData]
val newDS = csvInput
.flatMap { row =>
Try {
val data = new DomainData()
data.setScore(row.score.trim.toDouble)
data.setId(UUID.randomUUID().toString())
data.setDate(Util.parseDate(row.eventTime.trim))
data.setUpdateDate(new Date())
data
} match {
case Success(map) => Seq(map)
case _ => {
errorAcc.add(1)
Seq()
}
}
}
I tried to use Either
, but it failed with the exception:
java.lang.NoClassDefFoundError: no Java class corresponding to Product with Serializable with scala.util.Either[xx.CSVInputData,xx.DomainData] found
Update
I think Either doesn't work with spark 2.0 dataset api:
spark.read.option("header", "true").csv("any.csv").map { row =>
try {
Right("")
} catch { case e: Throwable => Left(""); }
}
If we change to use sc(rdd api), it works:
sc.parallelize('a' to 'z').map { row =>
try {
Right("")
} catch { case e: Throwable => Left(""); }
}.collect()
In current latest scala http://www.scala-lang.org/api/2.11.x/index.html#scala.util.Either: Either doesn't implements Serializable trait
sealed abstract class Either[+A, +B] extends AnyRef
In future 2.12 http://www.scala-lang.org/api/2.12.x/scala/util/Either.html, it does:
sealed abstract class Either[+A, +B] extends Product with Serializable
Updated 2 with workaround
More info at Spark ETL: Using Either to handle invalid data
As spark dataset doesn't work with Either, so the workaround is to call ds.rdd, then use try-left-right to capture both valid and invalid data.
spark.read.option("header", "true").csv("/Users/yyuan/jyuan/1.csv").rdd.map ( { row =>
try {
Right("")
} catch { case e: Throwable => Left(""); }
}).collect()
回答1:
Have you considered using an Either
val counts = csvInput
.map { row =>
try {
val data = new DomainData()
data.setScore(row.score.trim.toDouble)
data.setId(UUID.randomUUID().toString())
data.setDate(Util.parseDate(row.eventTime.trim))
data.setUpdateDate(new Date())
Right(data)
} catch {
case e: Throwable => Left(row)
}
}
val failedCount = counts.map(_.left).filter(_.e.isLeft).count()
val successCount = counts.map(_.right).filter(_.e.isRight).count()
回答2:
Did you try Spark DDQ - this has most of the data quality rules that you will need. You can even extend and customize it.
Link: https://github.com/FRosner/drunken-data-quality
来源:https://stackoverflow.com/questions/39755613/how-to-get-count-of-invalid-data-during-parse