Pyspark: Parse a column of json strings

前端未结

关注

 4  1347

忘掉有多难 2020-11-27 15:25

I have a pyspark dataframe consisting of one column, called json, where each row is a unicode string of json. I\'d like to parse each row and return a new dataf

4条回答

[愿得一人] (楼主)

2020-11-27 15:58
For Spark 2.1+, you can use from_json which allows the preservation of the other non-json columns within the dataframe as follows:
```
from pyspark.sql.functions import from_json, col
json_schema = spark.read.json(df.rdd.map(lambda row: row.json)).schema
df.withColumn('json', from_json(col('json'), json_schema))
```
You let Spark derive the schema of the json string column. Then the df.json column is no longer a StringType, but the correctly decoded json structure, i.e., nested StrucType and all the other columns of df are preserved as-is.

You can access the json content as follows:
```
df.select(col('json.header').alias('header'))
```
0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...