Why can't Impala read parquet files after Spark SQL's write?

你。 提交于 2019-11-29 12:40:00

From Configuration section of Parquet Files in the official documentation of Apache Spark:

spark.sql.parquet.writeLegacyFormat (default: false)

If true, data will be written in a way of Spark 1.4 and earlier. For example, decimal values will be written in Apache Parquet's fixed-length byte array format, which other systems such as Apache Hive and Apache Impala use. If false, the newer format in Parquet will be used. For example, decimals will be written in int-based format. If Parquet output is intended for use with systems that do not support this newer format, set to true.

Answer Given Before Official Docs Got Updated

The very similar SPARK-20297 Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala was quite lately (20/Apr/17 01:59) resolved as Not A Problem.

The main point is to use spark.sql.parquet.writeLegacyFormat property and write a parquet metadata in a legacy format (which I don't see described in the official documentation under Configuration and reported as an improvement in SPARK-20937).

Data written by Spark is readable by Hive and Impala when spark.sql.parquet.writeLegacyFormat is enabled.

It does follow the newer standard - https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal and I missed the documentation. Wouldn't it be then bugs in Impala or Hive?

The int32/int64 options were present in the original version of the decimal spec, they just weren't widely implemented: https://github.com/Parquet/parquet-format/commit/b2836e591da8216cfca47075baee2c9a7b0b9289 . So its not a new/old version thing, it was just an alternative representation that many systems didn't implement.

This SPARK-10400 can also be a quite helpful reading (about the history of spark.sql.parquet.writeLegacyFormat property):

We introduced SQL option "spark.sql.parquet.followParquetFormatSpec" while working on implementing Parquet backwards-compatibility rules in SPARK-6777. It indicates whether we should use legacy Parquet format adopted by Spark 1.4 and prior versions or the standard format defined in parquet-format spec. However, the name of this option is somewhat confusing, because it's not super intuitive why we shouldn't follow the spec. Would be nice to rename it to "spark.sql.parquet.writeLegacyFormat" and invert its default value (they have opposite meanings). Note that this option is not "public" (isPublic is false).

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!