Enable _metadata files in Spark 2.1.0

问题

It seems that saving empty Parquet files is broken in Spark 2.1.0 as it is not possible to read them in again (due to faulty schema inference)

I found that since Spark 2.0 writing the _metadata file is disabled by default when writing parquet files. But I cannot find the configuration setting to put this back on.

I tried the following:

spark_session = SparkSession.builder \
                        .master(url) \
                        .appName(name) \
                        .config('spark.hadoop.parquet.enable.summary-metadata', 'true') \
                        .getOrCreate()

and quite some different combination as without spark.hadoop for example.

The code I am trying to run in PySpark:

spark_session = session.get_session()
sc = spark_session.sparkContext

df = spark_session.createDataFrame(sc.emptyRDD(), schema)

df.write.mode('overwrite').parquet(path, compression='none')

# this works
df = spark_session.read.schema(schema).parquet(path)

# This throws an error
df = spark_session.read.parquet(path)

回答1:

It is a problem with the behavior of sc.emptyRDD(). You can find more information on https://github.com/apache/spark/pull/12855 why exactly this behavior occurs.

The current solution is to do the following: df = spark_session.createDataFrame(sc.emptyRDD(), schema).repartition(1) and still have the config settings mentioned in the question asked.

来源：https://stackoverflow.com/questions/41854135/enable-metadata-files-in-spark-2-1-0

标签

apache-spark

pyspark

parquet

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!