Hive partitions, Spark partitions and joins in Spark - how they relate

主宰稳场 提交于 2019-12-05 13:09:09

then how many partitions are the resultant datasets going to have respectively? Partitions equal to the number of objects in S3?

Impossible to answer given information you've provided. Number of partitions in latest versions depends on primarily on spark.sql.files.maxPartitionByte, although other factors can play some role as well.

Is Spark going to be able to utilize the fact that one of the fields being joined on is the partition key in the Hive tables to optimize the join?

Not as of today (Spark 2.3.0), however Spark can utilize bucketing (DISTRIBUTE BY) to optimize joins. See How to define partitioning of DataFrame?. This might change in the future, once Data Source API v2 stabilizes.

Suppose now that I am using RDDs instead (...) Again, is Spark going to be able to utilise the fact that the partition key in the Hive tables is being used in the join?

Not at all. Even if data is bucketed RDD transformations and functional Dataset transformations are black boxes. No optimization can be applied and are applied here.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!