问题
Trying to read multiple json files into a dataframe, both files have a "Value" node but the type of this node alternates between integer and struct:
File 1:
{
"Value": 123
}
File 2:
{
"Value": {
"Value": "On",
"ValueType": "State",
"IsSystemValue": true
}
}
My goal is to read the files into a dataframe like this:
|---------------------|------------------|---------------------|------------------|
| File | Value | ValueType | IsSystemValue |
|---------------------|------------------|---------------------|------------------|
| File1.json | 123 | null | null |
|---------------------|------------------|---------------------|------------------|
| File2.json | On | State | true |
|---------------------|------------------|---------------------|------------------|
There is a possibility that all of the files read are like FileA and none like FileB, vice verse, or a combination of both. Its not known ahead of time. Any Ideas??
回答1:
Try if this helps-
Load the test data
/**
* test/File1.json
* -----
* {
* "Value": 123
* }
*/
/**
* test/File2.json
* ---------
* {
* "Value": {
* "Value": "On",
* "ValueType": "State",
* "IsSystemValue": true
* }
* }
*/
val path = getClass.getResource("/test" ).getPath
val df = spark.read
.option("multiLine", true)
.json(path)
df.show(false)
df.printSchema()
/**
* +-------------------------------------------------------+
* |Value |
* +-------------------------------------------------------+
* |{"Value":"On","ValueType":"State","IsSystemValue":true}|
* |123 |
* +-------------------------------------------------------+
*
* root
* |-- Value: string (nullable = true)
*/
Transform string json
df.withColumn("File", substring_index(input_file_name(),"/", -1))
.withColumn("ValueType", get_json_object(col("Value"), "$.ValueType"))
.withColumn("IsSystemValue", get_json_object(col("Value"), "$.IsSystemValue"))
.withColumn("Value", coalesce(get_json_object(col("Value"), "$.Value"), col("Value")))
.show(false)
/**
* +-----+----------+---------+-------------+
* |Value|File |ValueType|IsSystemValue|
* +-----+----------+---------+-------------+
* |On |File2.json|State |true |
* |123 |File1.json|null |null |
* +-----+----------+---------+-------------+
*/
来源:https://stackoverflow.com/questions/62228733/spark-read-json-how-to-read-field-that-alternates-between-integer-and-struct