How to join two JDBC tables and avoid Exchange?

问题

I've got ETL-like scenario, in which I read data from multiple JDBC tables and files and perform some aggregations and join between sources.

In one step I must join two JDBC tables. I've tried to do something like:

val df1 = spark.read.format("jdbc")
            .option("url", Database.DB_URL)
            .option("user", Database.DB_USER)
            .option("password", Database.DB_PASSWORD)
            .option("dbtable", tableName)
            .option("driver", Database.DB_DRIVER)
            .option("upperBound", data.upperBound)
            .option("lowerBound", data.lowerBound)
            .option("numPartitions", data.numPartitions)
            .option("partitionColumn", data.partitionColumn)
            .load();

val df2 = spark.read.format("jdbc")
            .option("url", Database.DB_URL)
            .option("user", Database.DB_USER)
            .option("password", Database.DB_PASSWORD)
            .option("dbtable", tableName)
            .option("driver", Database.DB_DRIVER)
            .option("upperBound", data2.upperBound)
            .option("lowerBound", data2.lowerBound)
            .option("numPartitions", data2.numPartitions)
            .option("partitionColumn", data2.partitionColumn)
            .load();

df1.join(df2, Seq("partition_key", "id")).show();

Note that partitionColumn in both cases is the same - "partition_key".

However, when I run such query, I can see unnecessary exchange (plan cleared for readability):

df1.join(df2, Seq("partition_key", "id")).explain(extended = true);

Project [many many fields]
+- Project [partition_key#10090L, iv_id#10091L, last_update_timestamp#10114,  ... more fields]
    +- SortMergeJoin [partition_key#10090L, id#10091L], [partition_key#10172L, id#10179L], Inner
       :- *Sort [partition_key#10090L ASC NULLS FIRST, iv_id#10091L ASC NULLS FIRST], false, 0
       :  +- Exchange hashpartitioning(partition_key#10090L, iv_id#10091L, 4)
       :     +- *Scan JDBCRelation((select mod(s.id, 23) as partition_key, s.* from tab2 s)) [numPartitions=23] [partition_key#10090L,id#10091L,last_update_timestamp#10114] PushedFilters: [*IsNotNull(PARTITION_KEY)], ReadSchema: struct<partition_key:bigint,id:bigint,last_update_timestamp:timestamp>
       +- *Sort [partition_key#10172L ASC NULLS FIRST, id#10179L ASC NULLS FIRST], false, 0
          +- Exchange hashpartitioning(partition_key#10172L, iv_id#10179L, 4)
             +- *Project [partition_key#10172L, id#10179L ... 75 more fields]
               +- *Scan JDBCRelation((select mod(s.id, 23) as partition_key, s.* from tab1 s)) [numPartitions=23] [fields] PushedFilters: [*IsNotNull(ID), *IsNotNull(PARTITION_KEY)], ReadSchema: struct<partition_key:bigint,id:bigint...

If we have already partitioned reading with numPartitions and other options, partition count is the same, why there is a need for another Exchange? Can we somehow avoid this unnecessary shuffle? On the test data I see Sparks sends more than 150M of data during this Exchange, where production Datasets are much bigger, so it can be serious bottleneck.

回答1:

With current implementation of the Date Source API there is no partitioning information passed upstream so even if data could be joined without a shuffle, Spark cannot use this information. Therefore your assumption that:

JdbcRelation uses RangePartitioning on reading

is just incorrect. Furthermore it looks like Spark uses the same internal code to handle range-based JDBC partitions and predicate-based JDBC partitions. While the former one could be translated to SortOrder, the latter one might be incompatible with Spark SQL in general.

When in doubt, it is possible to retrieve Partitioner information using QueryExecution and internal RDD:

df.queryExecution.toRdd.partitioner

This might change in the future (SPIP: Data Source API V2, SPARK-15689 - Data source API v2 and Spark Data Frame. PreSorded partitions ).

来源：https://stackoverflow.com/questions/47597970/how-to-join-two-jdbc-tables-and-avoid-exchange

标签

apache-spark

apache-spark-sql