问题
I'm loading many versions of JSON files to spark DataFrame. some of the files holds columns A,B and some A,B,C or A,C..
If I run this command
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.sql("SELECT A,B,C FROM table")
after loading several I can get error "column not exist" I loaded only files that are not holding column C.
How can set this value to null
instead of getting error?
回答1:
DataFrameReader.json method provides optional schema argument you can use here. If your schema is complex the simplest solution is to reuse one inferred from the file which contains all the fields:
df_complete = spark.read.json("complete_file")
schema = df_complete.schema
df_with_missing = spark.read.json("df_with_missing", schema)
# or
# spark.read.schema(schema).("df_with_missing")
If you know schema but for some reason you cannot use above you have to create it from scratch.
schema = StructType([
StructField("A", LongType(), True), ..., StructField("C", LongType(), True)])
As always be sure to perform some quality checks after loading your data.
Example (note that all fields are nullable
):
from pyspark.sql.types import *
schema = StructType([
StructField("x1", FloatType()),
StructField("x2", StructType([
StructField("y1", DoubleType()),
StructField("y2", StructType([
StructField("z1", StringType()),
StructField("z2", StringType())
]))
])),
StructField("x3", StringType()),
StructField("x4", IntegerType())
])
spark.read.json(sc.parallelize(["""{"x4": 1}"""]), schema).printSchema()
## root
## |-- x1: float (nullable = true)
## |-- x2: struct (nullable = true)
## | |-- y1: double (nullable = true)
## | |-- y2: struct (nullable = true)
## | | |-- z1: string (nullable = true)
## | | |-- z2: string (nullable = true)
## |-- x3: string (nullable = true)
## |-- x4: integer (nullable = true)
spark.read.json(sc.parallelize(["""{"x4": 1}"""]), schema).first()
## Row(x1=None, x2=None, x3=None, x4=1)
spark.read.json(sc.parallelize(["""{"x3": "foo", "x1": 1.0}"""]), schema).first()
## Row(x1=1.0, x2=None, x3='foo', x4=None)
spark.read.json(sc.parallelize(["""{"x2": {"y2": {"z2": "bar"}}}"""]), schema).first()
## Row(x1=None, x2=Row(y1=None, y2=Row(z1=None, z2='bar')), x3=None, x4=None)
Important:
This method is applicable only to JSON source and depend on the detail of implementation. Don't use it for sources like Parquet.
来源:https://stackoverflow.com/questions/32166812/spark-set-null-when-column-not-exist-in-dataframe