I am reading a file into a dataframe like this
val df = spark.read
.option("sep", props.inputSeperator)
.option("header", "true")
.option("badRecordsPath", "/mnt/adls/udf_databricks/error")
.csv(inputLoc)
The file is setup like this
col_a|col_b|col_c|col_d
1|first|last|
2|this|is|data
3|ok
4|more||stuff
5|||
Now, spark will read all of this as acceptable data. However, I want 3|ok
to be marked as a bad record because it's size does not match the header size. Is this possible?
val a = spark.sparkContext.textFile(pathOfYourFile)
val size = a.first.split("\\|").length
a.filter(i => i.split("\\|",-1).size != size).saveAsTextFile("/mnt/adls/udf_databricks/error")
The below code is supported by databricks implementation of spark.I dont see schema mapping in your code. could you map it and try ?
.option("badRecordsPath", "/mnt/adls/udf_databricks/error")
Change your code like below,
val customSchema = StructType(Array(
StructField("col_a", StringType, true),
StructField("col_b", StringType, true),
StructField("col_c", StringType, true),
StructField("col_d", StringType, true)))
val df = spark.read
.option("sep", props.inputSeperator)
.option("header", "true")
.option("badRecordsPath", "/mnt/adls/udf_databricks/error")
.schema(customSchema)
.csv(inputLoc)
More detail's you can refer Datbricks doc on badrecordspath
Thanks, Karthick
来源:https://stackoverflow.com/questions/52043653/possible-to-put-records-that-arent-same-length-as-header-records-to-bad-record