How to efficiently read multiple small parquet files with Spark? is there a CombineParquetInputFormat?
问题 Spark generated multiple small parquet Files. How can one handle efficiently small number of parquet files both on producer and consumer Spark jobs. 回答1: The most straightforward approach IMHO is to use repartition/coalesce (prefer coalesce unless data is skewed and you want to create same-sized outputs) before writing parquet files so that you will not create small files to begin with. df .map(<some transformation>) .filter(<some filter>) ///... .coalesce(<number of partitions>) .write