I have a column \'true_recoms\' in spark dataframe:
-RECORD 17-----------------------------------------------------------------
item | 20380109
The schema is incorrectly defined. You declare to be as struct
with two string fields
item
recoms
while neither field is present in the document.
Unfortunately from_json
can take return only structs or array of structs so redefining it as
MapType(StringType(), LongType())
is not an option.
Personally I would use an udf
from pyspark.sql.functions import udf, explode
import json
@udf("map<string, bigint>")
def parse(s):
try:
return json.loads(s)
except json.JSONDecodeError:
pass
which can be applied like this
df = spark.createDataFrame(
[(31746548, """{"5556867":1,"5801144":5,"7397596":21}""")],
("item", "true_recoms")
)
df.select("item", explode(parse("true_recoms")).alias("recom_item", "recom_cnt")).show()
# +--------+----------+---------+
# | item|recom_item|recom_cnt|
# +--------+----------+---------+
# |31746548| 5801144| 5|
# |31746548| 7397596| 21|
# |31746548| 5556867| 1|
# +--------+----------+---------+