问题
I am trying to create dataframe from any json string to dataframe. The json string is generally very deep and nested some times. The json string is like:
val json_string = """{
"Total Value": 3,
"Topic": "Example",
"values": [
{
"value1": "#example1",
"points": [
[
"123",
"156"
]
],
"properties": {
"date": "12-04-19",
"model": "Model example 1"
}
},
{"value2": "#example2",
"points": [
[
"124",
"157"
]
],
"properties": {
"date": "12-05-19",
"model": "Model example 2"
}
}
]
}"""
The output which I am expecting is:
+-----------+-----------+----------+------------------+------------------+------------------------+-----------------------------+
|Total Value| Topic |values 1 | values.points[0] | values.points[1] | values.properties.date | values.properties.model |
+-----------+-----------+----------+------------------+------------------+------------------------+-----------------------------+
| 3 | Example | example1 | 123 | 156 | 12-04-19 | Model Example 1 |
| 3 | Example | example2 | 124 | 157 | 12-05-19 | Model example 2
+-----------+-----------+----------+------------------+------------------+------------------------+-----------------------------+
I am doing flattening but choosing some key in json for getting schema and then flattening but I don't want to flatten in this way. It should be independent of any key to be given and flatten accordingly as shown in output in above. Even after giving key that is values in this case, I am still getting 2 columns for same records due to the points is array so points[0] one columns and points[1] for different columns. My Scala spark code is:
val key = "values" //Ideally this should not be given in my case.
val jsonFullDFSchemaString = spark.read.json(json_location).select(col(key)).schema.json; // changing values to reportData
val jsonFullDFSchemaStructType = DataType.fromJson(jsonFullDFSchemaString).asInstanceOf[StructType]
val df = spark.read.schema(jsonFullDFSchemaStructType).json(json_location);
Now for flattening I am using:
def flattenDataframe(df: DataFrame): DataFrame = {
//getting all the fields from schema
val fields = df.schema.fields
val fieldNames = fields.map(x => x.name)
//length shows the number of fields inside dataframe
val length = fields.length
for (i <- 0 to fields.length - 1) {
val field = fields(i)
val fieldtype = field.dataType
val fieldName = field.name
fieldtype match {
case arrayType: ArrayType =>
val fieldName1 = fieldName
val fieldNamesExcludingArray = fieldNames.filter(_ != fieldName1)
val fieldNamesAndExplode = fieldNamesExcludingArray ++ Array(s"explode_outer($fieldName1) as $fieldName1")
//val fieldNamesToSelect = (fieldNamesExcludingArray ++ Array(s"$fieldName1.*"))
val explodedDf = df.selectExpr(fieldNamesAndExplode: _*)
return flattenDataframe(explodedDf)
case structType: StructType =>
val childFieldnames = structType.fieldNames.map(childname => fieldName + "." + childname)
val newfieldNames = fieldNames.filter(_ != fieldName) ++ childFieldnames
val renamedcols = newfieldNames.map(x => (col(x.toString()).as(x.toString().replace(".", "_").replace("$", "_").replace("__", "_").replace(" ", "").replace("-", ""))))
val explodedf = df.select(renamedcols: _*)
return flattenDataframe(explodedf)
case _ =>
}
}
df
}
Now finally getting flatten dataframe from json:
val tableSchemaDF = flattenDataframe(df)
println(tableSchemaDF)
So ideally any json file should get flatten accordingly as I shown above without giving any root key and without creating 2 rows. Hope I have given enough details. Any help will be appreciated. Thanks.
Please Note: The Json Data is coming from API so it's not certain that the root key 'values' will be there or not. That's why I am not going with giving key for flattening.
回答1:
Here's a solution based on recursion, just a bit "hacky" at the end since you have specificities :
@tailrec
def recurs(df: DataFrame): DataFrame = {
if(df.schema.fields.find(_.dataType match {
case ArrayType(StructType(_),_) | StructType(_) => true
case _ => false
}).isEmpty) df
else {
val columns = df.schema.fields.map(f => f.dataType match {
case _: ArrayType => explode(col(f.name)).as(f.name)
case s: StructType => col(s"${f.name}.*")
case _ => col(f.name)
})
recurs(df.select(columns:_*))
}
}
val recursedDF = recurs(df)
val valuesColumns = recursedDF.columns.filter(_.startsWith("value"))
val projectionDF = recursedDF.withColumn("values", coalesce(valuesColumns.map(col):_*))
.withColumn("point[0]", $"points".getItem(0))
.withColumn("point[1]", $"points".getItem(1))
.drop(valuesColumns :+ "points":_*)
projectionDF.show(false)
Output :
+-------+-----------+--------+---------------+---------+--------+--------+
|Topic |Total Value|date |model |values |point[0]|point[1]|
+-------+-----------+--------+---------------+---------+--------+--------+
|Example|3 |12-04-19|Model example 1|#example1|123 |156 |
|Example|3 |12-05-19|Model example 2|#example2|124 |157 |
+-------+-----------+--------+---------------+---------+--------+--------+
回答2:
I would rather suggest going with the spark
in-built
function. You can take advantage of theexplode
of aspark
function to achieve this.
here is the code snippet.
scala> val df = spark.read.json(Seq(json_string).toDS)
scala> var dfd = df.select($"topic",$"total value",explode($"values").as("values"))
Here I am choosing the column based on your needs. If no column is in the dataframe, please add based on your requirement.
scala> dfd.select($"topic",$"total value",$"values.points".getItem(0)(0).as("point_0"),$"values.points".getItem(0)(1).as("point_1"),$"values.properties.date".as("_date"),$"values.properties.model".as("_model")).show
+-------+-----------+-------+-------+--------+---------------+
| topic|total value|point_0|point_1| _date| _model|
+-------+-----------+-------+-------+--------+---------------+
|Example| 3| 123| 156|12-04-19|Model example 1|
|Example| 3| 124| 157|12-05-19|Model example 2|
+-------+-----------+-------+-------+--------+---------------+
If you have a limited number of columns in JSON, this approach will give you an optimal result.
来源:https://stackoverflow.com/questions/59299725/flatten-any-nested-json-string-and-convert-to-dataframe-using-spark-scala