spark - set null when column not exist in dataframe

↘锁芯ラ 提交于 2021-02-09 02:48:23

问题


I'm loading many versions of JSON files to spark DataFrame. some of the files holds columns A,B and some A,B,C or A,C..

If I run this command

from pyspark.sql import SQLContext

sqlContext = SQLContext(sc)

df = sqlContext.sql("SELECT A,B,C FROM table")

after loading several I can get error "column not exist" I loaded only files that are not holding column C. How can set this value to null instead of getting error?


回答1:


DataFrameReader.json method provides optional schema argument you can use here. If your schema is complex the simplest solution is to reuse one inferred from the file which contains all the fields:

df_complete = spark.read.json("complete_file")
schema = df_complete.schema

df_with_missing = spark.read.json("df_with_missing", schema)
# or
# spark.read.schema(schema).("df_with_missing")

If you know schema but for some reason you cannot use above you have to create it from scratch.

schema = StructType([
    StructField("A", LongType(), True), ..., StructField("C", LongType(), True)])

As always be sure to perform some quality checks after loading your data.

Example (note that all fields are nullable):

from pyspark.sql.types import *

schema = StructType([
    StructField("x1", FloatType()),
    StructField("x2", StructType([
        StructField("y1", DoubleType()),
        StructField("y2", StructType([
            StructField("z1", StringType()),
            StructField("z2", StringType())
        ]))
    ])),
    StructField("x3", StringType()),
    StructField("x4", IntegerType())
])

spark.read.json(sc.parallelize(["""{"x4": 1}"""]), schema).printSchema()
## root
##  |-- x1: float (nullable = true)
##  |-- x2: struct (nullable = true)
##  |    |-- y1: double (nullable = true)
##  |    |-- y2: struct (nullable = true)
##  |    |    |-- z1: string (nullable = true)
##  |    |    |-- z2: string (nullable = true)
##  |-- x3: string (nullable = true)
##  |-- x4: integer (nullable = true)

spark.read.json(sc.parallelize(["""{"x4": 1}"""]), schema).first()
## Row(x1=None, x2=None, x3=None, x4=1)

spark.read.json(sc.parallelize(["""{"x3": "foo", "x1": 1.0}"""]), schema).first()
## Row(x1=1.0, x2=None, x3='foo', x4=None)

spark.read.json(sc.parallelize(["""{"x2": {"y2": {"z2": "bar"}}}"""]), schema).first()
## Row(x1=None, x2=Row(y1=None, y2=Row(z1=None, z2='bar')), x3=None, x4=None)

Important:

This method is applicable only to JSON source and depend on the detail of implementation. Don't use it for sources like Parquet.



来源:https://stackoverflow.com/questions/32166812/spark-set-null-when-column-not-exist-in-dataframe

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!