I would like to know what is the best practice for reading a newline delimited JSON file into a dataframe. Critically, one of the (required) fields in each record maps to an
I recommend looking into Rumble to query, on Spark, heterogeneous JSON datasets that do not fit in DataFrames. This is precisely the problem it solves. It is free and open-source.
For example:
for $i in json-file("s3://bucket/path/to/newline_separated_json.txt")
where keys($i.data) = "key2" (: keeping only those objects that have a key2 :)
group by $type := $i.type
return {
"type" : $type,
"key2-values" : [ $i.data.key2 ]
}
(Disclaimer: I am part of the team.)