Enable _metadata files in Spark 2.1.0

為{幸葍}努か 提交于 2020-08-07 08:17:09

问题


It seems that saving empty Parquet files is broken in Spark 2.1.0 as it is not possible to read them in again (due to faulty schema inference)

I found that since Spark 2.0 writing the _metadata file is disabled by default when writing parquet files. But I cannot find the configuration setting to put this back on.

I tried the following:

spark_session = SparkSession.builder \
                        .master(url) \
                        .appName(name) \
                        .config('spark.hadoop.parquet.enable.summary-metadata', 'true') \
                        .getOrCreate()

and quite some different combination as without spark.hadoop for example.

The code I am trying to run in PySpark:

spark_session = session.get_session()
sc = spark_session.sparkContext

df = spark_session.createDataFrame(sc.emptyRDD(), schema)

df.write.mode('overwrite').parquet(path, compression='none')

# this works
df = spark_session.read.schema(schema).parquet(path)

# This throws an error
df = spark_session.read.parquet(path)

回答1:


It is a problem with the behavior of sc.emptyRDD(). You can find more information on https://github.com/apache/spark/pull/12855 why exactly this behavior occurs.

The current solution is to do the following: df = spark_session.createDataFrame(sc.emptyRDD(), schema).repartition(1) and still have the config settings mentioned in the question asked.



来源:https://stackoverflow.com/questions/41854135/enable-metadata-files-in-spark-2-1-0

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!