Filtering out nested array entries in a DataFrame (JSON) by passing in a list of values to match against

只谈情不闲聊 提交于 2019-12-25 19:02:36

问题


I read in a DataFrame with a huge file holding on each line of it a JSON object as follows:

{
  "userId": "12345",
  "vars": {
    "test_group": "group1",
    "brand": "xband"
  },
  "modules": [
    {
      "id": "New"
    },
    {
      "id": "Default"
    },
    {
      "id": "BestValue"
    },
    {
      "id": "Rating"
    },
    {
      "id": "DeliveryMin"
    },
    {
      "id": "Distance"
    }
  ]
}

I would like to pass in to a method a list of module id-s and clear out all items, which don't make part of that list of module id-s. It should remove all other modules, which's id is not equal to any of the values from the passed in list.

Would you have a solution for it?


回答1:


As you know from Deleting nested array entries in a DataFrame (JSON) on a condition the way of reading your json file and manipulation of modules column which has schema of

root
 |-- modules: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: string (nullable = true)

which says that modules is a collection of struct[String]. For the current requirement you will have to convert the Array[struct[String]] to Array[String]

val finaldf = df.withColumn("modules", explode($"modules.id"))
                  .groupBy("userId", "vars").agg(collect_list("modules").as("modules"))

Next step would be define a udf function as

def contains = udf((list: mutable.WrappedArray[String]) => {
  val validModules = ??? //your array definition here for example : Array("Default", "BestValue")
  list.filter(validModules.contains(_))
})

And just call the udf function as

finaldf.withColumn("modules", contains($"modules")).show(false)

That should be it. I hope the answer is helpful.



来源:https://stackoverflow.com/questions/47391632/filtering-out-nested-array-entries-in-a-dataframe-json-by-passing-in-a-list-of

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!