PySpark “explode” dict in column

后端 未结 1 1685
[愿得一人]
[愿得一人] 2020-12-10 17:54

I have a column \'true_recoms\' in spark dataframe:

-RECORD 17----------------------------------------------------------------- 
item        | 20380109               


        
相关标签:
1条回答
  • 2020-12-10 18:39

    The schema is incorrectly defined. You declare to be as struct with two string fields

    • item
    • recoms

    while neither field is present in the document.

    Unfortunately from_json can take return only structs or array of structs so redefining it as

    MapType(StringType(), LongType())
    

    is not an option.

    Personally I would use an udf

    from pyspark.sql.functions import udf, explode
    import json
    
    @udf("map<string, bigint>")
    def parse(s):
        try:
            return json.loads(s)
        except json.JSONDecodeError:
            pass 
    

    which can be applied like this

    df = spark.createDataFrame(
        [(31746548, """{"5556867":1,"5801144":5,"7397596":21}""")],
        ("item", "true_recoms")
    )
    
    df.select("item",  explode(parse("true_recoms")).alias("recom_item", "recom_cnt")).show()
    # +--------+----------+---------+
    # |    item|recom_item|recom_cnt|
    # +--------+----------+---------+
    # |31746548|   5801144|        5|
    # |31746548|   7397596|       21|
    # |31746548|   5556867|        1|
    # +--------+----------+---------+
    
    0 讨论(0)
提交回复
热议问题