问题
I've got ETL-like scenario, in which I read data from multiple JDBC tables and files and perform some aggregations and join between sources.
In one step I must join two JDBC tables. I've tried to do something like:
val df1 = spark.read.format("jdbc")
.option("url", Database.DB_URL)
.option("user", Database.DB_USER)
.option("password", Database.DB_PASSWORD)
.option("dbtable", tableName)
.option("driver", Database.DB_DRIVER)
.option("upperBound", data.upperBound)
.option("lowerBound", data.lowerBound)
.option("numPartitions", data.numPartitions)
.option("partitionColumn", data.partitionColumn)
.load();
val df2 = spark.read.format("jdbc")
.option("url", Database.DB_URL)
.option("user", Database.DB_USER)
.option("password", Database.DB_PASSWORD)
.option("dbtable", tableName)
.option("driver", Database.DB_DRIVER)
.option("upperBound", data2.upperBound)
.option("lowerBound", data2.lowerBound)
.option("numPartitions", data2.numPartitions)
.option("partitionColumn", data2.partitionColumn)
.load();
df1.join(df2, Seq("partition_key", "id")).show();
Note that partitionColumn
in both cases is the same - "partition_key".
However, when I run such query, I can see unnecessary exchange (plan cleared for readability):
df1.join(df2, Seq("partition_key", "id")).explain(extended = true);
Project [many many fields]
+- Project [partition_key#10090L, iv_id#10091L, last_update_timestamp#10114, ... more fields]
+- SortMergeJoin [partition_key#10090L, id#10091L], [partition_key#10172L, id#10179L], Inner
:- *Sort [partition_key#10090L ASC NULLS FIRST, iv_id#10091L ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(partition_key#10090L, iv_id#10091L, 4)
: +- *Scan JDBCRelation((select mod(s.id, 23) as partition_key, s.* from tab2 s)) [numPartitions=23] [partition_key#10090L,id#10091L,last_update_timestamp#10114] PushedFilters: [*IsNotNull(PARTITION_KEY)], ReadSchema: struct<partition_key:bigint,id:bigint,last_update_timestamp:timestamp>
+- *Sort [partition_key#10172L ASC NULLS FIRST, id#10179L ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(partition_key#10172L, iv_id#10179L, 4)
+- *Project [partition_key#10172L, id#10179L ... 75 more fields]
+- *Scan JDBCRelation((select mod(s.id, 23) as partition_key, s.* from tab1 s)) [numPartitions=23] [fields] PushedFilters: [*IsNotNull(ID), *IsNotNull(PARTITION_KEY)], ReadSchema: struct<partition_key:bigint,id:bigint...
If we have already partitioned reading with numPartitions
and other options, partition count is the same, why there is a need for another Exchange? Can we somehow avoid this unnecessary shuffle? On the test data I see Sparks sends more than 150M of data during this Exchange, where production Datasets
are much bigger, so it can be serious bottleneck.
回答1:
With current implementation of the Date Source API there is no partitioning information passed upstream so even if data could be joined without a shuffle, Spark cannot use this information. Therefore your assumption that:
JdbcRelation uses RangePartitioning on reading
is just incorrect. Furthermore it looks like Spark uses the same internal code to handle range-based JDBC partitions and predicate-based JDBC partitions. While the former one could be translated to SortOrder
, the latter one might be incompatible with Spark SQL in general.
When in doubt, it is possible to retrieve Partitioner
information using QueryExecution
and internal RDD
:
df.queryExecution.toRdd.partitioner
This might change in the future (SPIP: Data Source API V2, SPARK-15689 - Data source API v2 and Spark Data Frame. PreSorded partitions ).
来源:https://stackoverflow.com/questions/47597970/how-to-join-two-jdbc-tables-and-avoid-exchange