Spark read CSV - Not showing corroupt Records

佐手、 提交于 2021-01-27 20:54:30

问题


Spark has a Permissive mode for reading CSV files which stores the corroupt records into a separate column named _corroupt_record.

permissive - Sets all fields to null when it encounters a corrupted record and places all corrupted records in a string column called _corrupt_record

However, when I am trying following example, I don't see any column named _corroupt_record. the reocords which doesn't match with schema appears to be null

data.csv

data
10.00
11.00
$12.00
$13
gaurang

code

import org.apache.spark.sql.types.{StructField, StructType, StringType, LongType, DecimalType}
val schema = new StructType(Array(
new StructField("value", DecimalType(25,10), false)
))
val df = spark.read.format("csv") 
  .option("header", "true") 
  .option("mode", "PERMISSIVE") 
  .schema(schema) 
  .load("../test.csv")

schema

scala> df.printSchema()
root
 |-- value: decimal(25,10) (nullable = true)


scala> df.show()
+-------------+
|        value|
+-------------+
|10.0000000000|
|11.0000000000|
|         null|
|         null|
|         null|
+-------------+

If I change the mode to FAILFAST I am getting error when I try to see data.


回答1:


Adding the _corroup_record as suggested by Andrew and Prateek resolved the issue.

import org.apache.spark.sql.types.{StructField, StructType, StringType, LongType, DecimalType}
val schema = new StructType(Array(
  new StructField("value", DecimalType(25,10), true),
  new StructField("_corrupt_record", StringType, true)
))
val df = spark.read.format("csv") 
  .option("header", "true") 
  .option("mode", "PERMISSIVE") 
  .schema(schema) 
  .load("../test.csv")

querying Data

scala> df.show()
+-------------+---------------+
|        value|_corrupt_record|
+-------------+---------------+
|10.0000000000|           null|
|11.0000000000|           null|
|         null|         $12.00|
|         null|            $13|
|         null|        gaurang|
+-------------+---------------+


来源:https://stackoverflow.com/questions/58631365/spark-read-csv-not-showing-corroupt-records

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!