问题
It seems that saving empty Parquet files is broken in Spark 2.1.0 as it is not possible to read them in again (due to faulty schema inference)
I found that since Spark 2.0 writing the _metadata file is disabled by default when writing parquet files. But I cannot find the configuration setting to put this back on.
I tried the following:
spark_session = SparkSession.builder \
.master(url) \
.appName(name) \
.config('spark.hadoop.parquet.enable.summary-metadata', 'true') \
.getOrCreate()
and quite some different combination as without spark.hadoop
for example.
The code I am trying to run in PySpark:
spark_session = session.get_session()
sc = spark_session.sparkContext
df = spark_session.createDataFrame(sc.emptyRDD(), schema)
df.write.mode('overwrite').parquet(path, compression='none')
# this works
df = spark_session.read.schema(schema).parquet(path)
# This throws an error
df = spark_session.read.parquet(path)
回答1:
It is a problem with the behavior of sc.emptyRDD()
. You can find more information on https://github.com/apache/spark/pull/12855 why exactly this behavior occurs.
The current solution is to do the following: df = spark_session.createDataFrame(sc.emptyRDD(), schema).repartition(1)
and still have the config settings mentioned in the question asked.
来源:https://stackoverflow.com/questions/41854135/enable-metadata-files-in-spark-2-1-0