问题
We have generated a parquet file in Dask (Python) and with Drill (R using the Sergeant packet ). We have noticed a few issues:
- The format of the
Dask(i.e.fastparquet) has a_metadataand a_common_metadatafiles while theparquetfile inR \ Drilldoes not have these files and haveparquet.crcfiles instead (which can be deleted). what is the difference between theseparquetimplementations?
回答1:
(only answering to 1), please post separate questions to make it easier to answer)
_metadata and _common_metadata are helper files that are not required for a Parquet dataset, these ones are used by Spark/Dask/Hive/... to infer the metadata of all Parquet files of a dataset without the need to read the footer of all files. In constrast to this, Apache Drill generates a similar file in each folder (on demand) that contains all footers of all Parquet files. Only on the first query on a dataset all files are read, further queries will only read the file that caches all footers.
Tools using _metadata and _common_metadata should be able to leverage them to have faster execution times but not depend on them for operations. In the case that they are non-existent, the query engine then simply needs to read all footers.
来源:https://stackoverflow.com/questions/45415829/generating-parquet-files-differences-between-r-and-python