Spark read Parquet files of different versions

问题

I have parquet files generated for over a year with a Version1 schema. And with a recent schema change the newer parquet files have Version2 schema extra columns.

So when i load parquet files from the old version and new version together and try to filter on the changed columns i get an exception.

I would like for spark to read old and new files and fill in null values where the column is not present.Is there a workaround for this where spark fills null values when the column is not found?

回答1:

there are two ways you may have a try.

1.like this way that you can use a map transform,but this not recommended,such as spark.read.parquet("mypath").map(e => val field =if (e.isNullAt(e.fieldIndex("field"))) null else e.getAs[String]("field"))

2.the best way that you can use mergeSchema option,such as :

spark.read.option("mergeSchema", "true").parquet(xxx).as[MyClass]

ref:schema-merging

回答2:

Assuming you have a set of files you intend to read:

Query the schema of each file in the set producing N sets of files, each set containing files with similar schemas.
Operate on each set of files using a filter compatible with the schema in each set.
Union the results of filtering/operating on each set (assuming your output is the same for results from each file schema)

回答3:

SparkSQL itself support schema merging for parquet files. You can read all about it in official documentation here

Like ProtocolBuffer, Avro, and Thrift, Parquet also supports schema evolution. Users can start with a simple schema, and gradually add more columns to the schema as needed. In this way, users may end up with multiple Parquet files with different but mutually compatible schemas. The Parquet data source is now able to automatically detect this case and merge schemas of all these files.

Since schema merging is a relatively expensive operation, and is not a necessity in most cases, we turned it off by default starting from 1.5.0. You may enable it by

setting data source option mergeSchema to true when reading Parquet files (as shown in the examples below), or
setting the global SQL option spark.sql.parquet.mergeSchema to true.

来源：https://stackoverflow.com/questions/43668666/spark-read-parquet-files-of-different-versions

标签

apache-spark

parquet

versions