Dealing with non-uniform JSON columns in spark dataframe

后端 未结 2 947
闹比i
闹比i 2021-01-14 07:48

I would like to know what is the best practice for reading a newline delimited JSON file into a dataframe. Critically, one of the (required) fields in each record maps to an

2条回答
  •  猫巷女王i
    2021-01-14 08:38

    I recommend looking into Rumble to query, on Spark, heterogeneous JSON datasets that do not fit in DataFrames. This is precisely the problem it solves. It is free and open-source.

    For example:

    for $i in json-file("s3://bucket/path/to/newline_separated_json.txt")
    where keys($i.data) = "key2" (: keeping only those objects that have a key2 :)
    group by $type := $i.type
    return {
      "type" : $type,
      "key2-values" : [ $i.data.key2 ]
    }
    

    (Disclaimer: I am part of the team.)

提交回复
热议问题