Extract value from structure within an array of arrays in spark using scala

时光怂恿深爱的人放手 提交于 2019-12-10 22:49:58

问题


I am reading json data into spark data frame using scala. The schema is as follows:

 root
 |-- metadata: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- playerId: string (nullable = true)
 |    |    |-- sources: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- matchId: long (nullable = true)

The data looks as follows:

{ "metadata" : [ { "playerId" : "1234", "sources" : [ { "matchId": 1 } ] }, { "playerId": "1235", "sources": [ { "matchId": 1 } ] } ] }
{ "metadata" : [ { "playerId" : "1234", "sources" : [ { "matchId": 2 } ] }, { "playerId": "1248", "sources": [ { "score": 12.2 , "matchId": 1 } ] } ] }
{ "metadata" : [ { "playerId" : "1234", "sources" : [ { "matchId": 3 } ] }, { "playerId": "1248", "sources": [ { "matchId": 3 } ] } ] }

The goal is to find if playerId is 1234 and matchId is 1, then return isPlayed as true. The structure of sources is not fixed. There might be fields other than matchId.

I wrote a udf considering metadata to be of type WrappedArray[String] and am able to read the playerId column

def hasPlayer = udf((metadata: WrappedArray[String], playerId: String) => { 
  metadata.contains(playerId)
  })

df.withColumn("hasPlayer", hasPlayer(col("metadata"), col("superPlayerId")))

But I am not able to figure out how to query the matchId field given playerId. I tried reading the field as WrappedArray[WrappedArray[Long]] but it gives a typecasting exception in withColumn on metadata.sources.matchId column.

I am relatively new to Spark. Any help would be deeply appreciated.

Cheers!


回答1:


When you are dealing with JSON, get to know the built-in function explode, which turns a single cell containing a WrappedArray into multiple rows representing the structs inside. I think it helps here (twice):

df.select(explode($"metadata").as("metadata"))
  .select($"metadata.playerId", explode($"metadata.sources.matchId").as("matchId"))
  .filter($"matchId".equalTo(1))
  .select($"matchId", lit(true).as("isPlayed"))

Basically I use explode to create multiple rows (and rename to something convenient), navigate the object tree to the JSON fields I want, repeat the explode/rename process for the matchId, and filter out everything that isn't 1 This allows me finally to use the lit function to "hardcode" values of true for a brand new column called isPlayed since everything that isn't 1 is gone.

If this isn't exactly what you're looking for, hopefully it gives you some pointers. The functions library can be very helpful to you as you get to know Spark.



来源:https://stackoverflow.com/questions/43424554/extract-value-from-structure-within-an-array-of-arrays-in-spark-using-scala

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!