Pyspark: Parse a column of json strings

前端 未结 4 1347
忘掉有多难
忘掉有多难 2020-11-27 15:25

I have a pyspark dataframe consisting of one column, called json, where each row is a unicode string of json. I\'d like to parse each row and return a new dataf

4条回答
  •  [愿得一人]
    2020-11-27 15:58

    For Spark 2.1+, you can use from_json which allows the preservation of the other non-json columns within the dataframe as follows:

    from pyspark.sql.functions import from_json, col
    json_schema = spark.read.json(df.rdd.map(lambda row: row.json)).schema
    df.withColumn('json', from_json(col('json'), json_schema))
    

    You let Spark derive the schema of the json string column. Then the df.json column is no longer a StringType, but the correctly decoded json structure, i.e., nested StrucType and all the other columns of df are preserved as-is.

    You can access the json content as follows:

    df.select(col('json.header').alias('header'))
    

提交回复
热议问题