Pyspark: Parse a column of json strings

前端 未结 4 1356
忘掉有多难
忘掉有多难 2020-11-27 15:25

I have a pyspark dataframe consisting of one column, called json, where each row is a unicode string of json. I\'d like to parse each row and return a new dataf

4条回答
  •  天命终不由人
    2020-11-27 15:45

    Converting a dataframe with json strings to structured dataframe is'a actually quite simple in spark if you convert the dataframe to RDD of strings before (see: http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets)

    For example:

    >>> new_df = sql_context.read.json(df.rdd.map(lambda r: r.json))
    >>> new_df.printSchema()
    root
     |-- body: struct (nullable = true)
     |    |-- id: long (nullable = true)
     |    |-- name: string (nullable = true)
     |    |-- sub_json: struct (nullable = true)
     |    |    |-- id: long (nullable = true)
     |    |    |-- sub_sub_json: struct (nullable = true)
     |    |    |    |-- col1: long (nullable = true)
     |    |    |    |-- col2: string (nullable = true)
     |-- header: struct (nullable = true)
     |    |-- foo: string (nullable = true)
     |    |-- id: long (nullable = true)
    

提交回复
热议问题