Discard Bad record and load only good records to dataframe from json file in pyspark

问题

The API generated json file looks like below. The Format of the JSON file is not correct. can we handle the bad records to discard and load only good rows to dataframe using pyspark.

{
"name": "PowerAmplifier",
"Component": "12uF Capacitor\n1/21Resistor\n3 Inductor In Henry\PowerAmplifier\n ",
"url": "https://www.onsemi.com/products/amplifiers-comparators/", 
"image": "https://www.onsemi.com/products/amplifiers-comparators/", 
"ThresholdTime": "48min", 
"MFRDate": "2019-05-08", 
"FallTime": "15Min", 
"description": "PowerAmplifier"
}

Code:

spark = SparkSession.builder \
      .master('local') \
      .appName('Fairchild') \
      .config('spark.executor.memory', '2g') \
      .getOrCreate()

df = sqlContext.read.json ("D:\\Fairchild\\component.json").select("name" , "url", "image","ThresholdTime", "MFRDate", "FallTime", "description")
#df.show()
df.registerTempTable("data")
spark.sql("select * from data").show()

Error:

Traceback (most recent call last):
  File "C:/Users/RAJEE/PycharmProjects/onsemi/test.py", line 24, in <module>
    result = spark.sql("select * from data")
  File "C:\spark\python\pyspark\sql\session.py", line 767, in sql
    return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
  File "C:\spark\python\lib\py4j-0.10.7-src.zip\py4j\java_gateway.py", line 1257, in __call__
  File "C:\spark\python\pyspark\sql\utils.py", line 69, in deco
    raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u'java.lang.RuntimeException: java.lang.RuntimeException:

Please share your suggestion.

回答1:

Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the referenced columns only include the internal corrupt record column. I added 2 json records, 1 good and 1 bad then I can query the dataframe. It needs at least one good json record by default. You can also ignore all malformed records by using option("mode", "DROPMALFORMED").

You can use option("mode", "DROPMALFORMED") in your python code too.

sqlContext.read.option("mode", "DROPMALFORMED").json ("D:\\Fairchild\\component.json").select("name" , "url", "image","ThresholdTime", "MFRDate", "FallTime", "description")

JSON data -

{"name": "PowerAmplifier","Component": "12uF Capacitor\n1/21Resistor\n3 Inductor In Henry\PowerAmplifier\n ","url": "https://www.onsemi.com/products/amplifiers-comparators/", "image": "https://www.onsemi.com/products/mplifiers-comparators/", "ThresholdTime": "48min", "MFRDate": "2019-05-08", "FallTime": "15Min", "description": "PowerAmplifier"}
{"name": "PowerAmplifier","Component": "good record","url": "https://www.onsemi.com/products/amplifiers-comparators/", "image": "https://www.onsemi.com/products/mplifiers-comparators/", "ThresholdTime": "48min", "MFRDate": "2019-05-08", "FallTime": "15Min", "description": "PowerAmplifier"}

Code to read it without ignoring malformed. In this case you can filter bad records using column "_corrupt_record".

    val j = spark.read.json("/Users/msayed2/Documents/temp/test.txt")
    j.show()

Result

+-----------+--------+----------+-------------+--------------------+--------------+--------------------+--------------+--------------------+
|  Component|FallTime|   MFRDate|ThresholdTime|     _corrupt_record|   description|               image|          name|                 url|
+-----------+--------+----------+-------------+--------------------+--------------+--------------------+--------------+--------------------+
|       null|    null|      null|         null|{"name": "PowerAm...|          null|                null|          null|                null|
|good record|   15Min|2019-05-08|        48min|                null|PowerAmplifier|https://www.onsem...|PowerAmplifier|https://www.onsem...|
+-----------+--------+----------+-------------+--------------------+--------------+--------------------+--------------+--------------------+

You can totally ignore malformed like this -

    val j = spark.read.option("mode", "DROPMALFORMED").json("/Users/msayed2/Documents/temp/test.txt")
    j.show()

Result doesn't have malformed

+-----------+--------+----------+-------------+--------------+--------------------+--------------+--------------------+
|  Component|FallTime|   MFRDate|ThresholdTime|   description|               image|          name|                 url|
+-----------+--------+----------+-------------+--------------+--------------------+--------------+--------------------+
|good record|   15Min|2019-05-08|        48min|PowerAmplifier|https://www.onsem...|PowerAmplifier|https://www.onsem...|
+-----------+--------+----------+-------------+--------------+--------------------+--------------+--------------------+

来源：https://stackoverflow.com/questions/59774527/discard-bad-record-and-load-only-good-records-to-dataframe-from-json-file-in-pys

标签

apache-spark

pyspark

apache-spark-sql