问题
I have a json object that has an unfortunate combination of nesting and arrays. So its not totally obvious how to query it with spark sql.
here is a sample object:
{
stuff: [
{a:1,b:2,c:3}
]
}
so, in javascript, to get the value for c
, I'd write myData.stuff[0].c
And in my spark sql query, if that array wasn't there, I'd be able to use dot notation:
SELECT stuff.c FROM blah
but I can't, because the innermost object is wrapped in an array.
I've tried:
SELECT stuff.0.c FROM blah // FAIL
SELECT stuff.[0].c FROM blah // FAIL
So, what is the magical way to select that data? or is that even supported yet?
回答1:
It is not clear what you mean by JSON object so lets consider two different cases:
An array of structs
import tempfile path = tempfile.mktemp() with open(path, "w") as fw: fw.write('''{"stuff": [{"a": 1, "b": 2, "c": 3}]}''') df = sqlContext.read.json(path) df.registerTempTable("df") df.printSchema() ## root ## |-- stuff: array (nullable = true) ## | |-- element: struct (containsNull = true) ## | | |-- a: long (nullable = true) ## | | |-- b: long (nullable = true) ## | | |-- c: long (nullable = true) sqlContext.sql("SELECT stuff[0].a FROM df").show() ## +---+ ## |_c0| ## +---+ ## | 1| ## +---+
An array of maps
# Note: schema inference from dictionaries has been deprecated # don't use this in practice df = sc.parallelize([{"stuff": [{"a": 1, "b": 2, "c": 3}]}]).toDF() df.registerTempTable("df") df.printSchema() ## root ## |-- stuff: array (nullable = true) ## | |-- element: map (containsNull = true) ## | | |-- key: string ## | | |-- value: long (valueContainsNull = true) sqlContext.sql("SELECT stuff[0]['a'] FROM df").show() ## +---+ ## |_c0| ## +---+ ## | 1| ## +---+
See also Querying Spark SQL DataFrame with complex types
来源:https://stackoverflow.com/questions/34916038/sparksql-sql-syntax-for-nth-item-in-array