Schema evolution in parquet format

后端 未结 2 1773
我寻月下人不归
我寻月下人不归 2020-12-07 18:58

Currently we are using Avro data format in production. Out of several good points using Avro, we know that it is good in schema evolution.

Now we are evaluating

2条回答
  •  余生分开走
    2020-12-07 19:37

    In addition to the above answer, other option is to set

    "spark.hadoop.parquet.enable.summary-metadata" to "true"
    

    What it does: it creates summary files with the schema when you write files. You will see summary files with '_metadata' and '_common_metadata' postfixes after saving. The _common_metadata is the compressed schema which is read everytime you read the parquet file. This makes read very fast as you have already have the schema. Spark looks for these schema files, if present, to get the schema.

    Note that this makes writes very slow as Spark has to merge the schema of all files and create these schema file.

    We had a similar situation where the parquet schema changed. What we did is set the above config to true for sometime after schema change so that the schema files are generated and then set it back to false. We had to compromise on slow writes for some time but after the schema files were generated, setting it to false served the purpose. And with a bonus of reading the files faster.

提交回复
热议问题