Schema evolution in parquet format

Currently we are using Avro data format in production. Out of several good points using Avro, we know that it is good in schema evolution.

Now we are evaluating Parquet format because of its efficiency while reading random columns. So before moving forward our concern is still schema evolution.

Does anyone know if schema evolution is possible in parquet, if yes How is it possible, if no then Why not.

Some resources claim that it is possible but it can only add columns at end.

What does this mean?

Schema evolution can be (very) expensive.

In order to figure out schema, you basically have to read all of your parquet files and reconcile/merge their schemas during reading time which can be expensive depending on how many files or/and how many columns in there in the dataset.

Thus, since Spark 1.5, they switched off schema merging by default. You can always switch it back on).

Since schema merging is a relatively expensive operation, and is not a necessity in most cases, we turned it off by default starting from 1.5.0.

Without schema evolution, you can read schema from one parquet file, and while reading rest of files assume it stays the same.

Parquet schema evolution is implementation-dependent.

Hive for example has a knob parquet.column.index.access=false that you could set to map schema by column names rather than by column index.

Then you could delete columns too, not just add.

As I said above, it is implementation-dependent, for example, Impala would not read such parquet tables correctly (fixed in recent Impala 2.6 release) [Reference].

Apache Spark, as of version 2.0.2, seems still only support adding columns: [Reference]

Users can start with a simple schema, and gradually add more columns to the schema as needed. In this way, users may end up with multiple Parquet files with different but mutually compatible schemas. The Parquet data source is now able to automatically detect this case and merge schemas of all these files.

PS: What I have seen some folks do to have more agility on schema changes, is that they create a view on top of actual parquet tables that map two (or more ) different but compatible schemas to one common schema.

Let's say you have added one new field (registration_date) and dropped another column (last_login_date) in your new release, then this would look like:

CREATE VIEW datamart.unified_fact_vw
AS
SELECT f1..., NULL as registration_date 
FROM datamart.unified_fact_schema1 f1
UNION ALL
SELECT f2..., NULL as last_login_date
FROM datamart.unified_fact_schema2 f2
;

You got the idea. Nice thing it would work the same across all sql on Hadoop dialects (like I mentioned above Hive, Impala and Spark), and still have all the benefits of Parquet tables (columnar storage, predicate push-down etc).

In addition to the above answer, other option is to set

"spark.hadoop.parquet.enable.summary-metadata" to "true"

What this does is creates summary files with the schema when you write files. When you save, you will see files summary files '_metadata' and '_common_metadata'. _common_metadata is the compressed schema which is read everytime you read the parquet file. This make read very fast as you have already have the schema. Spark looks for these schema files, if present, to get the schema.

Note that this makes writes very slow as Spark has to merge the schema of all files and create these schema file.

We had a similar situation where the parquet schema changed. What we did is set the above config to true for sometime after schema change so that the schema files are generated and then set it back to false. We had to compromise on slow writes for some time but after the schema files were generated, setting it to false served the purpose. And with a bonus of reading the files faster.

来源：https://stackoverflow.com/questions/37644664/schema-evolution-in-parquet-format

标签

apache-spark

Hadoop

data-warehouse

avro

parquet