Optimizing join in vertica

问题

I hava a query like this

SELECT a.column, b.column
FROM
table_a a INNER JOIN tableb_b ON
a.id= b.id
where a.anotherid = 'some condition'

It is supposed to be very fast because with the predicate a.anotherid = 'some condition' the query plan should filter much data on table_b. However, according to the document of Vertica,

The WHERE clause is evaluated after the join is performed. It filters records returned by the FROM clause, eliminating any records that do not satisfy the WHERE clause condition.

Which is mean the query will do the join first and then filtering, which is very slow, this is also showed in the query plan

So, is there any way to push the filter before the join? Or is there any other way to optimize the query?

回答1:

The EXPLAIN shows NO STATISTICS. These need to be updated.
Vertica will optimize the predicate in this case using SIP:

Sideways Information Passing (SIP) has been effective in improving join performance by filtering data as early as possible in the plan. It can be thought of as an advanced variation of predicate push down since the join is being used to do filtering [27]. For example, consider a HashJoin that joins two tables using simple equality predicates. The HashJoin will first create a hash table from the inner input before it starts reading data from the outer input to do the join. Special SIP filters are built during optimizer planning and placed in the Scan operator. At run time, the Scan has access to the Join’s hash table and the SIP filters are used to evaluate whether the outer key values exist in the hash table. Rows that do not pass these filters are not output by the Scan thus increaseing performance since we are not unnecessarily bringing the data through the plan only to be filtered away later by the join.

For example:

SELECT a.online_page_key
FROM   online_sales.online_sales_fact a
       JOIN online_sales.online_page_dimension b
         ON b.online_page_key = a.online_page_key
WHERE  b.page_type = 'quarterly';

Will produce the same plan as:

SELECT a.online_page_key 
FROM   online_sales.online_sales_fact a 
       JOIN (SELECT * 
             FROM   online_sales.online_page_dimension 
             WHERE  page_type = 'quarterly') b 
         ON b.online_page_key = a.online_page_key;

Which looks like:

 Access Path:
 +-JOIN HASH [Cost: 14K, Rows: 988K] (PATH ID: 1)
 |  Join Cond: (online_page_dimension.online_page_key = a.online_page_key)
 | +-- Outer -> STORAGE ACCESS for a [Cost: 12K, Rows: 5M] (PATH ID: 2)
 | |      Projection: online_sales.online_sales_fact_super
 | |      Materialize: a.online_page_key
 | |      Runtime Filter: (SIP1(HashJoin): a.online_page_key)
 | +-- Inner -> STORAGE ACCESS for online_page_dimension [Cost: 36, Rows: 198] (PATH ID: 3)
 | |      Projection: online_sales.online_page_dimension_super
 | |      Materialize: online_page_dimension.online_page_key
 | |      Filter: (online_page_dimension.page_type = 'quarterly')

Most times, a hash join is sufficient. If you want to improve for a merge join, see my post on optimizing for merge join.

来源：https://stackoverflow.com/questions/32224728/optimizing-join-in-vertica

标签

sql

database

vertica