aggregating jsonarray into Map<key, list> in spark in spark2.x

丶灬走出姿态 提交于 2019-12-11 06:24:52

问题


I am quite new to Spark. I have a input json file which I am reading as

val df = spark.read.json("/Users/user/Desktop/resource.json");

Contents of resource.json looks like this:

{"path":"path1","key":"key1","region":"region1"}
{"path":"path112","key":"key1","region":"region1"}
{"path":"path22","key":"key2","region":"region1"}

Is there any way we can process this dataframe and aggregate result as

Map<key, List<data>>

where data is each json object in which key is present.

For ex: expected result is

Map<key1 =[{"path":"path1","key":"key1","region":"region1"}, {"path":"path112","key":"key1","region":"region1"}] ,
key2 = [{"path":"path22","key":"key2","region":"region1"}]>

Any reference/documents/link to proceed further would be a great help.

Thank you.


回答1:


Here is what you can do:

import org.json4s._
import org.json4s.jackson.Serialization.read
case class cC(path: String, key: String, region: String)

val df = spark.read.json("/Users/user/Desktop/resource.json");

scala> df.show
+----+-------+-------+
| key|   path| region|
+----+-------+-------+
|key1|  path1|region1|
|key1|path112|region1|
|key2| path22|region1|
+----+-------+-------+
//Please note that original json structure is gone. Use .toJSON to get json back and extract key from json and create RDD[(String, String)] RDD[(key, json)]

val rdd = df.toJSON.rdd.map(m => {
implicit val formats = DefaultFormats
val parsedObj = read[cC](m)
(parsedObj.key, m)
})

scala> rdd.collect.groupBy(_._1).map(m => (m._1,m._2.map(_._2).toList))
res39: scala.collection.immutable.Map[String,List[String]] = Map(key2 -> List({"key":"key2","path":"path22","region":"region1"}), key1 -> List({"key":"key1","path":"path1","region":"region1"}, {"key":"key1","path":"path112","region":"region1"}))



回答2:


You can use groupBy with collect_list, which is an aggregation function that collects all matching values into a list per key.

Notice that the original JSON strings are already "gone" (Spark parses them into individual columns), so if you really want a list of all records (with all their columns, including the key), you can use the struct function to combine columns into one column:

import org.apache.spark.sql.functions._
import spark.implicits._

df.groupBy($"key")
  .agg(collect_list(struct($"path", $"key", $"region")) as "value")

The result would be:

+----+--------------------------------------------------+
|key |value                                             |
+----+--------------------------------------------------+
|key1|[[path1, key1, region1], [path112, key1, region1]]|
|key2|[[path22, key2, region1]]                         |
+----+--------------------------------------------------+


来源:https://stackoverflow.com/questions/50915562/aggregating-jsonarray-into-mapkey-list-in-spark-in-spark2-x

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!