SparkSQL sql syntax for nth item in array

问题

I have a json object that has an unfortunate combination of nesting and arrays. So its not totally obvious how to query it with spark sql.

here is a sample object:

{
  stuff: [
    {a:1,b:2,c:3}
  ]
}

so, in javascript, to get the value for c, I'd write myData.stuff[0].c

And in my spark sql query, if that array wasn't there, I'd be able to use dot notation:

SELECT stuff.c FROM blah

but I can't, because the innermost object is wrapped in an array.

I've tried:

SELECT stuff.0.c FROM blah // FAIL
SELECT stuff.[0].c FROM blah // FAIL

So, what is the magical way to select that data? or is that even supported yet?

回答1:

It is not clear what you mean by JSON object so lets consider two different cases:

An array of structs

import tempfile    

path = tempfile.mktemp()
with open(path, "w") as fw: 
    fw.write('''{"stuff": [{"a": 1, "b": 2, "c": 3}]}''')
df = sqlContext.read.json(path)
df.registerTempTable("df")

df.printSchema()
## root
##  |-- stuff: array (nullable = true)
##  |    |-- element: struct (containsNull = true)
##  |    |    |-- a: long (nullable = true)
##  |    |    |-- b: long (nullable = true)
##  |    |    |-- c: long (nullable = true)

sqlContext.sql("SELECT stuff[0].a FROM df").show()

## +---+
## |_c0|
## +---+
## |  1|
## +---+

An array of maps

# Note: schema inference from dictionaries has been deprecated
# don't use this in practice
df = sc.parallelize([{"stuff": [{"a": 1, "b": 2, "c": 3}]}]).toDF()
df.registerTempTable("df")

df.printSchema()
## root
##  |-- stuff: array (nullable = true)
##  |    |-- element: map (containsNull = true)
##  |    |    |-- key: string
##  |    |    |-- value: long (valueContainsNull = true)

sqlContext.sql("SELECT stuff[0]['a'] FROM df").show()
## +---+
## |_c0|
## +---+
## |  1|
## +---+

See also Querying Spark SQL DataFrame with complex types

来源：https://stackoverflow.com/questions/34916038/sparksql-sql-syntax-for-nth-item-in-array

标签

python

apache-spark

pyspark

apache-spark-sql