How do I use a from_json() dataframe in Spark?

ⅰ亾dé卋堺 提交于 2019-12-13 07:54:35

问题


I'm trying to create a dataset from a json-string within a dataframe in Databricks 3.5 (Spark 2.2.1). In the code block below 'jsonSchema' is a StructType with the correct layout for the json-string which is in the 'body' column of the dataframe.

val newDF = oldDF.select(from_json($"body".cast("string"), jsonSchema))

This returns a dataframe where the root object is

jsontostructs(CAST(body AS STRING)):struct

followed by the fields in the schema (looks correct). When I try another select on the newDF

val transform = newDF.select($"propertyNameInTheParsedJsonObject")

it throws the exception

org.apache.spark.sql.AnalysisException: cannot resolve '`columnName`' given 
input columns: [jsontostructs(CAST(body AS STRING))];;

I'm aparently missing something. I hoped from_json would return a dataframe I could manipulate further.

My ultimate objective is to cast the json-string within the oldDF body-column to a dataset.


回答1:


from_json returns a struct or (array<struct<...>>) column. It means it is a nested object. If you've provided a meaningful name:

val newDF = oldDF.select(from_json($"body".cast("string"), jsonSchema) as "parsed")

and the schema describes a plain struct you could use standard methods like

newDF.select($"parsed.propertyNameInTheParsedJsonObject")

otherwise please follow the instructions for accessing arrays.



来源:https://stackoverflow.com/questions/52945498/how-do-i-use-a-from-json-dataframe-in-spark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!