Spark Read Json: how to read field that alternates between integer and struct

问题

Trying to read multiple json files into a dataframe, both files have a "Value" node but the type of this node alternates between integer and struct:

File 1:

{
   "Value": 123
}

File 2:

{
   "Value": {
      "Value": "On",
      "ValueType": "State",
      "IsSystemValue": true
   }
}

My goal is to read the files into a dataframe like this:

|---------------------|------------------|---------------------|------------------|
|         File        |       Value      |      ValueType      |   IsSystemValue  |
|---------------------|------------------|---------------------|------------------|
|      File1.json     |        123       |        null         |       null       |
|---------------------|------------------|---------------------|------------------|
|      File2.json     |        On        |        State        |       true       |
|---------------------|------------------|---------------------|------------------|

There is a possibility that all of the files read are like FileA and none like FileB, vice verse, or a combination of both. Its not known ahead of time. Any Ideas??

回答1:

Try if this helps-

Load the test data

    /**
      * test/File1.json
      * -----
      * {
      * "Value": 123
      * }
      */
    /**
      * test/File2.json
      * ---------
      * {
      * "Value": {
      * "Value": "On",
      * "ValueType": "State",
      * "IsSystemValue": true
      * }
      * }
      */
    val path = getClass.getResource("/test" ).getPath
    val df = spark.read
      .option("multiLine", true)
      .json(path)

    df.show(false)
    df.printSchema()

    /**
      * +-------------------------------------------------------+
      * |Value                                                  |
      * +-------------------------------------------------------+
      * |{"Value":"On","ValueType":"State","IsSystemValue":true}|
      * |123                                                    |
      * +-------------------------------------------------------+
      *
      * root
      * |-- Value: string (nullable = true)
      */

Transform string json

    df.withColumn("File", substring_index(input_file_name(),"/", -1))
      .withColumn("ValueType", get_json_object(col("Value"), "$.ValueType"))
      .withColumn("IsSystemValue", get_json_object(col("Value"), "$.IsSystemValue"))
      .withColumn("Value", coalesce(get_json_object(col("Value"), "$.Value"), col("Value")))
      .show(false)

    /**
      * +-----+----------+---------+-------------+
      * |Value|File      |ValueType|IsSystemValue|
      * +-----+----------+---------+-------------+
      * |On   |File2.json|State    |true         |
      * |123  |File1.json|null     |null         |
      * +-----+----------+---------+-------------+
      */

来源：https://stackoverflow.com/questions/62228733/spark-read-json-how-to-read-field-that-alternates-between-integer-and-struct

标签

apache-spark

pyspark

databricks