问题
I'm trying to create a dataset from a json-string within a dataframe in Databricks 3.5 (Spark 2.2.1). In the code block below 'jsonSchema' is a StructType with the correct layout for the json-string which is in the 'body' column of the dataframe.
val newDF = oldDF.select(from_json($"body".cast("string"), jsonSchema))
This returns a dataframe where the root object is
jsontostructs(CAST(body AS STRING)):struct
followed by the fields in the schema (looks correct). When I try another select on the newDF
val transform = newDF.select($"propertyNameInTheParsedJsonObject")
it throws the exception
org.apache.spark.sql.AnalysisException: cannot resolve '`columnName`' given
input columns: [jsontostructs(CAST(body AS STRING))];;
I'm aparently missing something. I hoped from_json would return a dataframe I could manipulate further.
My ultimate objective is to cast the json-string within the oldDF body-column to a dataset.
回答1:
from_json
returns a struct
or (array<struct<...>>
) column. It means it is a nested object. If you've provided a meaningful name:
val newDF = oldDF.select(from_json($"body".cast("string"), jsonSchema) as "parsed")
and the schema describes a plain struct
you could use standard methods like
newDF.select($"parsed.propertyNameInTheParsedJsonObject")
otherwise please follow the instructions for accessing arrays.
来源:https://stackoverflow.com/questions/52945498/how-do-i-use-a-from-json-dataframe-in-spark