问题
My setup:
Two Spark clusters. One on EC2 and one on Amazon EMR. Both with Spark 1.3.1.
The EMR cluster was installed with emr-bootstrap-actions. The EC2 cluster was installed with Spark's default EC2 scripts.
The code:
Read a folder containing 12 Parquet files and count the number of partitions
val logs = sqlContext.parquetFile(“s3n://mylogs/”)
logs.rdd.partitions.length
Observations:
- On EC2 this code gives me 12 partitions (one per file, makes sense).
- On EMR this code gives me 138 (!) partitions.
Question:
What controls the number of partitions when reading Parquet files?
I read the exact same folder on S3, with the exact same Spark release. This leads me to believe that there might be some configuration settings which control how partitioning happens. Does anyone have more info on this?
Insights would be greatly appreciated.
Thanks.
UPDATE:
It seems that the many partitions are created by EMR's S3 file system implementation (com.amazon.ws.emr.hadoop.fs.EmrFileSystem
).
When removing
<property><name>fs.s3n.impl</name><value>com.amazon.ws.emr.hadoop.fs.EmrFileSystem</value></property>
from core-site.xml
(hereby reverting to Hadoop's S3 filesystem), I end up with 12 partitions.
When running with EmrFileSystem
, it seems that the number of partitions can be controlled with:
<property><name>fs.s3n.block.size</name><value>xxx</value></property>
Could there be a cleaner way of controlling the # of partitions when using EmrFileSystem
?
来源:https://stackoverflow.com/questions/30168280/what-controls-the-number-of-partitions-when-reading-parquet-files