How to join two JDBC tables and avoid Exchange?

你说的曾经没有我的故事 提交于 2021-02-09 03:01:10

问题


I've got ETL-like scenario, in which I read data from multiple JDBC tables and files and perform some aggregations and join between sources.

In one step I must join two JDBC tables. I've tried to do something like:

val df1 = spark.read.format("jdbc")
            .option("url", Database.DB_URL)
            .option("user", Database.DB_USER)
            .option("password", Database.DB_PASSWORD)
            .option("dbtable", tableName)
            .option("driver", Database.DB_DRIVER)
            .option("upperBound", data.upperBound)
            .option("lowerBound", data.lowerBound)
            .option("numPartitions", data.numPartitions)
            .option("partitionColumn", data.partitionColumn)
            .load();

val df2 = spark.read.format("jdbc")
            .option("url", Database.DB_URL)
            .option("user", Database.DB_USER)
            .option("password", Database.DB_PASSWORD)
            .option("dbtable", tableName)
            .option("driver", Database.DB_DRIVER)
            .option("upperBound", data2.upperBound)
            .option("lowerBound", data2.lowerBound)
            .option("numPartitions", data2.numPartitions)
            .option("partitionColumn", data2.partitionColumn)
            .load();

df1.join(df2, Seq("partition_key", "id")).show();

Note that partitionColumn in both cases is the same - "partition_key".

However, when I run such query, I can see unnecessary exchange (plan cleared for readability):

df1.join(df2, Seq("partition_key", "id")).explain(extended = true);
Project [many many fields]
+- Project [partition_key#10090L, iv_id#10091L, last_update_timestamp#10114,  ... more fields]
    +- SortMergeJoin [partition_key#10090L, id#10091L], [partition_key#10172L, id#10179L], Inner
       :- *Sort [partition_key#10090L ASC NULLS FIRST, iv_id#10091L ASC NULLS FIRST], false, 0
       :  +- Exchange hashpartitioning(partition_key#10090L, iv_id#10091L, 4)
       :     +- *Scan JDBCRelation((select mod(s.id, 23) as partition_key, s.* from tab2 s)) [numPartitions=23] [partition_key#10090L,id#10091L,last_update_timestamp#10114] PushedFilters: [*IsNotNull(PARTITION_KEY)], ReadSchema: struct<partition_key:bigint,id:bigint,last_update_timestamp:timestamp>
       +- *Sort [partition_key#10172L ASC NULLS FIRST, id#10179L ASC NULLS FIRST], false, 0
          +- Exchange hashpartitioning(partition_key#10172L, iv_id#10179L, 4)
             +- *Project [partition_key#10172L, id#10179L ... 75 more fields]
               +- *Scan JDBCRelation((select mod(s.id, 23) as partition_key, s.* from tab1 s)) [numPartitions=23] [fields] PushedFilters: [*IsNotNull(ID), *IsNotNull(PARTITION_KEY)], ReadSchema: struct<partition_key:bigint,id:bigint...

If we have already partitioned reading with numPartitions and other options, partition count is the same, why there is a need for another Exchange? Can we somehow avoid this unnecessary shuffle? On the test data I see Sparks sends more than 150M of data during this Exchange, where production Datasets are much bigger, so it can be serious bottleneck.


回答1:


With current implementation of the Date Source API there is no partitioning information passed upstream so even if data could be joined without a shuffle, Spark cannot use this information. Therefore your assumption that:

JdbcRelation uses RangePartitioning on reading

is just incorrect. Furthermore it looks like Spark uses the same internal code to handle range-based JDBC partitions and predicate-based JDBC partitions. While the former one could be translated to SortOrder, the latter one might be incompatible with Spark SQL in general.

When in doubt, it is possible to retrieve Partitioner information using QueryExecution and internal RDD:

df.queryExecution.toRdd.partitioner

This might change in the future (SPIP:​ ​ Data​ ​ Source​ ​ API​ ​ V2, SPARK-15689 - Data source API v2 and Spark Data Frame. PreSorded partitions ).



来源:https://stackoverflow.com/questions/47597970/how-to-join-two-jdbc-tables-and-avoid-exchange

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!