Possible to put records that aren't same length as header records to bad_record directory

I am reading a file into a dataframe like this

val df = spark.read
   .option("sep", props.inputSeperator)
   .option("header", "true")
   .option("badRecordsPath", "/mnt/adls/udf_databricks/error")
   .csv(inputLoc)

The file is setup like this

col_a|col_b|col_c|col_d
1|first|last|
2|this|is|data
3|ok
4|more||stuff
5|||

Now, spark will read all of this as acceptable data. However, I want 3|ok to be marked as a bad record because it's size does not match the header size. Is this possible?

val a = spark.sparkContext.textFile(pathOfYourFile)
val size = a.first.split("\\|").length
a.filter(i => i.split("\\|",-1).size != size).saveAsTextFile("/mnt/adls/udf_databricks/error")

The below code is supported by databricks implementation of spark.I dont see schema mapping in your code. could you map it and try ?

.option("badRecordsPath", "/mnt/adls/udf_databricks/error")

Change your code like below,

val customSchema = StructType(Array(
    StructField("col_a", StringType, true),
    StructField("col_b", StringType, true),
    StructField("col_c", StringType, true),
    StructField("col_d", StringType, true)))

val df = spark.read
   .option("sep", props.inputSeperator)
   .option("header", "true")
   .option("badRecordsPath", "/mnt/adls/udf_databricks/error")
   .schema(customSchema)
   .csv(inputLoc)

More detail's you can refer Datbricks doc on badrecordspath

Thanks, Karthick

来源：https://stackoverflow.com/questions/52043653/possible-to-put-records-that-arent-same-length-as-header-records-to-bad-record

标签

scala

apache-spark

error-handling

apache-spark-sql

databricks