load parquet file and keep same number hdfs partitions

I have a parquet file /df saved in hdfs with 120 partitions. The size of each partition on hdfs is around 43.5 M.

Total size

hdfs dfs -du -s -h /df
5.1 G  15.3 G  /df

hdfs dfs -du -h /df
43.6 M  130.7 M  /df/pid=0
43.5 M  130.5 M  /df/pid=1
...
43.6 M  130.9 M  /df/pid=119

I want to load that file into Spark and keep the same number of partitions. However, Spark will automatically load the file into 60 partitions.

df = spark.read.parquet('df')
df.rdd.getNumPartitions()

HDFS settings:

'parquet.block.size' is not set.

sc._jsc.hadoopConfiguration().get('parquet.block.size')

returns nothing.

'dfs.blocksize' is set to 128.

float(sc._jsc.hadoopConfiguration().get("dfs.blocksize"))/2**20

returns

Changing either of those values to something lower does not result in the parquet file loading into the same number of partitions that are in hdfs.

For example:

sc._jsc.hadoopConfiguration().setInt("parquet.block.size", 64*2**20)
sc._jsc.hadoopConfiguration().setInt("dfs.blocksize", 64*2**20)

I realize 43.5 M is well below 128 M. However, for this application, I am going to immediately complete many transformations that will result in each of the 120 partitions getting much closer to 128 M.

I am trying to save myself having to repartition in the application imeadiately after loading.

Is there a way to force Spark to load the parquet file with the same number of partitions that are stored on the hdfs?

First, I'd start from checking on how Spark splits the data into partitions. By default it depends on the nature and size of your data & cluster. This article should provide you with the answer why your data frame was loaded to 60 partitions:

https://umbertogriffo.gitbooks.io/apache-spark-best-practices-and-tuning/content/sparksqlshufflepartitions_draft.html

In general - its Catalyst who takes care of all the optimization (including number of partitions), so unless there is really a good reason for custom settings, I'd let it do its job. If any of the transformations you use are wide, Spark will shuffle the data anyway.

来源：https://stackoverflow.com/questions/56602051/load-parquet-file-and-keep-same-number-hdfs-partitions

标签

apache-spark

Hadoop

pyspark

apache-spark-sql

parquet