Spark read CSV - Not showing corroupt Records

问题

Spark has a Permissive mode for reading CSV files which stores the corroupt records into a separate column named _corroupt_record.

permissive - Sets all fields to null when it encounters a corrupted record and places all corrupted records in a string column called _corrupt_record

However, when I am trying following example, I don't see any column named _corroupt_record. the reocords which doesn't match with schema appears to be null

data.csv

data
10.00
11.00
$12.00
$13
gaurang

code

import org.apache.spark.sql.types.{StructField, StructType, StringType, LongType, DecimalType}
val schema = new StructType(Array(
new StructField("value", DecimalType(25,10), false)
))
val df = spark.read.format("csv") 
  .option("header", "true") 
  .option("mode", "PERMISSIVE") 
  .schema(schema) 
  .load("../test.csv")

schema

scala> df.printSchema()
root
 |-- value: decimal(25,10) (nullable = true)


scala> df.show()
+-------------+
|        value|
+-------------+
|10.0000000000|
|11.0000000000|
|         null|
|         null|
|         null|
+-------------+

If I change the mode to FAILFAST I am getting error when I try to see data.

回答1:

Adding the _corroup_record as suggested by Andrew and Prateek resolved the issue.

import org.apache.spark.sql.types.{StructField, StructType, StringType, LongType, DecimalType}
val schema = new StructType(Array(
  new StructField("value", DecimalType(25,10), true),
  new StructField("_corrupt_record", StringType, true)
))
val df = spark.read.format("csv") 
  .option("header", "true") 
  .option("mode", "PERMISSIVE") 
  .schema(schema) 
  .load("../test.csv")

querying Data

scala> df.show()
+-------------+---------------+
|        value|_corrupt_record|
+-------------+---------------+
|10.0000000000|           null|
|11.0000000000|           null|
|         null|         $12.00|
|         null|            $13|
|         null|        gaurang|
+-------------+---------------+

来源：https://stackoverflow.com/questions/58631365/spark-read-csv-not-showing-corroupt-records

标签

apache-spark

apache-spark-sql

databricks