问题
I wanted to ask is there any significant difference in data partitioning when working with Hadoop/MapReduce and Spark? They both work on HDFS(TextInputFormat) so it should be same in theory.
Are there any cases where the there procedure of data partitioning can differ? Any insights would be very helpful to my study.
Thanks
回答1:
Is any significant difference in data partitioning when working with Hadoop/mapreduce and Spark?
Spark supports all hadoop I/O formats as it uses same Hadoop InputFormat APIs along with it's own formatters. So, Spark input partitions works same way as Hadoop/MapReduce input splits by default. Data size in a partition can be configurable at run time and It provides transformation like repartition
, coalesce
, and repartitionAndSortWithinPartition
can give you direct control over the number of partitions being computed.
Are there any cases where their procedure of data partitioning can differ?
Apart from Hadoop, I/O APIs Spark does have some other intelligent I/O Formats(Ex: Databricks CSV and NoSQL DB Connectors) which will directly return DataSet/DateFrame
(more high-level things on top of RDD) which are spark specific.
Key points on spark partitions when reading data from Non-Hadoop sources
- The maximum size of a partition is ultimately by the connectors,
- for S3, the property is like
fs.s3n.block.size
orfs.s3.block.size
. - Cassandra property is
spark.cassandra.input.split.size_in_mb
. - Mongo prop is,
spark.mongodb.input.partitionerOptions.partitionSizeMB
.
- for S3, the property is like
- By default the number of partitions
is the
max(sc.defaultParallelism, total_data_size / data_block_size)
. some times number of available cores in the cluster also imflunce the number of partitions likesc.parallelize()
without partitions param.
Read more.. link1
来源:https://stackoverflow.com/questions/39651842/difference-between-mapreduce-split-and-spark-paritition