问题
I am trying different ways to query a record within a array of records and display complete Row as output.
I dont know which nested Object has String "pg". But i want to query on particular object. Whether the object has "pg" or not. If "pg" exist then i want to display that complete row. How to write "spark sql query" on nested objects without specfying the object index.So i dont want to use the index of children.name
My Avro Record:
{
"name": "Parent",
"type":"record",
"fields":[
{"name": "firstname", "type": "string"},
{
"name":"children",
"type":{
"type": "array",
"items":{
"name":"child",
"type":"record",
"fields":[
{"name":"name", "type":"string"}
]
}
}
}
]
}
I am using Spark SQL context to query dataframe which is read. So if input is
Row no Firstname Children.name
1 John Max
Pg
2 Bru huna
aman
Output should return poq 1 since it has row where one object of children.name is pg.
val results = sqlc.sql("SELECT firstname, children.name FROM nestedread where children.name = 'pg'")
results.foreach(x=> println(x(0), x(1).toString))
The above query doesn't work. but it works when i query children[1].name.
I ALSO want to know that if i can filter a set of records and then explode. Instead of first explode and create large number of rows and then filter.
回答1:
It seems that you can use
org.apache.spark.sql.functions.explode(e: Column): Column
for example in my project(in java), i have nested json like this:
{
"error": [],
"trajet": [
{
"something": "value"
}
],
"infos": [
{
"something": "value"
}
],
"timeseries": [
{
"something_0": "value_0",
"something_1": "value_1",
...
"something_n": "value_n"
}
]
}
and i wanted to analyse datas in "timeseries", so i did:
DataFrame ts = jsonDF.select(org.apache.spark.sql.functions.explode(jsonDF.col("timeseries")).as("t"))
.select("t.something_0",
"t.something_1",
...
"t.something_n");
I'm new to spark too. Hope this could give you a hint.
回答2:
The problem was solved by
I found a way through Explode.
val results = sqlc.sql("SELECT firstname, child.name, FROM parent LATERAL VIEW explode(children) childTable AS child
来源:https://stackoverflow.com/questions/37270703/query-a-nested-array-in-parquet-records