What controls the number of partitions when reading Parquet files?

筅森魡賤 提交于 2019-12-07 13:51:43

问题


My setup:

Two Spark clusters. One on EC2 and one on Amazon EMR. Both with Spark 1.3.1.

The EMR cluster was installed with emr-bootstrap-actions. The EC2 cluster was installed with Spark's default EC2 scripts.

The code:

Read a folder containing 12 Parquet files and count the number of partitions

val logs = sqlContext.parquetFile(“s3n://mylogs/”)
logs.rdd.partitions.length

Observations:

  • On EC2 this code gives me 12 partitions (one per file, makes sense).
  • On EMR this code gives me 138 (!) partitions.

Question:

What controls the number of partitions when reading Parquet files?

I read the exact same folder on S3, with the exact same Spark release. This leads me to believe that there might be some configuration settings which control how partitioning happens. Does anyone have more info on this?

Insights would be greatly appreciated.

Thanks.

UPDATE:

It seems that the many partitions are created by EMR's S3 file system implementation (com.amazon.ws.emr.hadoop.fs.EmrFileSystem).

When removing

<property><name>fs.s3n.impl</name><value>com.amazon.ws.emr.hadoop.fs.EmrFileSystem</value></property>

from core-site.xml (hereby reverting to Hadoop's S3 filesystem), I end up with 12 partitions.

When running with EmrFileSystem, it seems that the number of partitions can be controlled with:

<property><name>fs.s3n.block.size</name><value>xxx</value></property>

Could there be a cleaner way of controlling the # of partitions when using EmrFileSystem?

来源:https://stackoverflow.com/questions/30168280/what-controls-the-number-of-partitions-when-reading-parquet-files

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!