How to transform JSON strings in columns of dataframe in PySpark?

孤人 提交于 2019-12-06 03:04:51

You should use the equivalents of Spark API for Scala's Dataset.withColumn and from_json standard function.

Extending on @Jacek Laskowski's post: First create the schema of the struct column. Then use from_json to convert the string column to a struct. Lastly we use the nested schema structure to extract the new columns (we use the f-strings which need python 3.6). On the struct-type you can directly use .select to operate on the nested structure.

schema = StructType([StructField("object",StringType()),
                    StructField("time",StringType()),
                    StructField("values",ArrayType(FloatType()))])

df=df.withColumn('_c0',f.from_json('_c0', schema))

select_list = ["_c0","_c1"] + [f.col(f'_c0.{column}').alias(column) for column in ["object","time","values"]] 
df.select(*select_list).show()

Output (just first to rows)

+--------------------+---+------+--------------------+--------------------+
|                 _c0|_c1|object|                time|              values|
+--------------------+---+------+--------------------+--------------------+
|[F, 2019-07-18T15...|  0|     F|2019-07-18T15:08:...|[0.22124143, 0.21...|
|[F, 2019-07-18T15...|  1|     F|2019-07-18T15:08:...|[0.22124143, 0.21...|
+--------------------+---+------+--------------------+--------------------+
serv-inc

df.rdd.map applies the given function to each row of data. I have not yet used the python variant of spark, but it could work like this:

import json

def wrangle(row):
   tmp = json.loads(row._c0)
   return (row._c1, tmp['object'], tmp['time'], tmp['values'])

df.rdd.map(wrangle).toDF()  # should yield a new frame/rdd with the object split

The question how to address the columns might work like that, but you seem to have figured that out already.

This loads the JSON-formatted string to a Python object and returns a tuple with the required elements. Maybe you need to return a Row object instead of a tuple, but, as above, I have not yet used the python part of spark.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!