What is the difference between “predicate pushdown” and “projection pushdown”?

问题

I have come across several sources of information, such as the one found here, which explain "predicate pushdown" as :

… if you can “push down” parts of the query to where the data is stored, and thus filter out most of the data, then you can greatly reduce network traffic.

However, I have also seen the term "projection pushdown" in other documentation such as here, which appears to be the same thing but I am not sure in my understanding.

Is there a specific difference between the two terms?

回答1:

Predicate refers to the where/filter clause which effects the amount of rows returned.

Projection refers to the selected columns.

For example:

If your filters pass only 5% of the rows, only 5% of the table will be passed from the storage to Spark instead of the full table.

If your projection selects only 3 columns out of 10, then less columns will be passed from the storage to Spark and if your storage is columnar (e.g. Parquet, not Avro) and the non selected columns are not a part of the filter, then these columns won't even have to be read.

回答2:

In set and bag relational algebra alike, predicate pushdown eliminates tuples.

In bag relational algebra, projection pushdown eliminates attributes ("columns"), but in case of column based storage, it doesn't matter much, becasue columns that are not used higher up aren't being carried to begin with. Even a row based database may or may not benefit from projection pushdown (even SQL doesn't specify a physical access plan). Projection in bag RA is a very nominal operation that can be physically done at just the metadata level (flag some columns as inaccessible).

In set relational algebra, projection pushdown generally eliminates tuples as well, so this is where it has significance. Set RA projection is not an inexpensive operation, due to the need for deduplication. It's like a GROUP BY with no aggregated fields. Still, it's often worth doing the projection before a join, due to a possible vast decrease of tuple count.

Bag algebra tools eg. SQL also have ways for set RA projection, such as SELECT DISTINCT.

Neither predicate pushdown nor projection pushdown lead to more optimal execution in all cases. Depends on their selectivity and a lot of other things. Still, esp. predicate pushdown is a good heuristic as joins tend to be most expensive.

If the pushed-down projection has a sort index for the retained fields, or it needs to scan the table, there may be join algorithms with which the projection can be fused, avoiding a double reading of table/index structures.

来源：https://stackoverflow.com/questions/58235076/what-is-the-difference-between-predicate-pushdown-and-projection-pushdown

标签

apache-spark

bigdata

parquet