PySpark “explode” dict in column

可紊 提交于 2019-11-28 11:40:19
hi-zir

The schema is incorrectly defined. You declare to be as struct with two string fields

  • item
  • recoms

while neither field is present in the document.

Unfortunately from_json can take return only structs or array of structs so redefining it as

MapType(StringType(), LongType())

is not an option.

Personally I would use an udf

from pyspark.sql.functions import udf, explode
import json

@udf("map<string, bigint>")
def parse(s):
    try:
        return json.loads(s)
    except json.JSONDecodeError:
        pass 

which can be applied like this

df = spark.createDataFrame(
    [(31746548, """{"5556867":1,"5801144":5,"7397596":21}""")],
    ("item", "true_recoms")
)

df.select("item",  explode(parse("true_recoms")).alias("recom_item", "recom_cnt")).show()
# +--------+----------+---------+
# |    item|recom_item|recom_cnt|
# +--------+----------+---------+
# |31746548|   5801144|        5|
# |31746548|   7397596|       21|
# |31746548|   5556867|        1|
# +--------+----------+---------+
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!