Why can't Impala read parquet files after Spark SQL's write?

前端 未结 1 711
予麋鹿
予麋鹿 2020-12-11 13:34

Having some issues with the way that Spark is interpreting columns for parquet.

I have an Oracle source with confirmed schema (df.schema() method):

r         


        
相关标签:
1条回答
  • 2020-12-11 14:05

    From Configuration section of Parquet Files in the official documentation of Apache Spark:

    spark.sql.parquet.writeLegacyFormat (default: false)

    If true, data will be written in a way of Spark 1.4 and earlier. For example, decimal values will be written in Apache Parquet's fixed-length byte array format, which other systems such as Apache Hive and Apache Impala use. If false, the newer format in Parquet will be used. For example, decimals will be written in int-based format. If Parquet output is intended for use with systems that do not support this newer format, set to true.

    Answer Given Before Official Docs Got Updated

    The very similar SPARK-20297 Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala was quite lately (20/Apr/17 01:59) resolved as Not A Problem.

    The main point is to use spark.sql.parquet.writeLegacyFormat property and write a parquet metadata in a legacy format (which I don't see described in the official documentation under Configuration and reported as an improvement in SPARK-20937).

    Data written by Spark is readable by Hive and Impala when spark.sql.parquet.writeLegacyFormat is enabled.

    It does follow the newer standard - https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal and I missed the documentation. Wouldn't it be then bugs in Impala or Hive?

    The int32/int64 options were present in the original version of the decimal spec, they just weren't widely implemented: https://github.com/Parquet/parquet-format/commit/b2836e591da8216cfca47075baee2c9a7b0b9289 . So its not a new/old version thing, it was just an alternative representation that many systems didn't implement.

    This SPARK-10400 can also be a quite helpful reading (about the history of spark.sql.parquet.writeLegacyFormat property):

    We introduced SQL option "spark.sql.parquet.followParquetFormatSpec" while working on implementing Parquet backwards-compatibility rules in SPARK-6777. It indicates whether we should use legacy Parquet format adopted by Spark 1.4 and prior versions or the standard format defined in parquet-format spec. However, the name of this option is somewhat confusing, because it's not super intuitive why we shouldn't follow the spec. Would be nice to rename it to "spark.sql.parquet.writeLegacyFormat" and invert its default value (they have opposite meanings). Note that this option is not "public" (isPublic is false).

    0 讨论(0)
提交回复
热议问题