What is the difference between “predicate pushdown” and “projection pushdown”?

狂风中的少年 提交于 2021-01-21 05:22:46

问题


I have come across several sources of information, such as the one found here, which explain "predicate pushdown" as :

… if you can “push down” parts of the query to where the data is stored, and thus filter out most of the data, then you can greatly reduce network traffic.

However, I have also seen the term "projection pushdown" in other documentation such as here, which appears to be the same thing but I am not sure in my understanding.

Is there a specific difference between the two terms?


回答1:


Predicate refers to the where/filter clause which effects the amount of rows returned.

Projection refers to the selected columns.

For example:

If your filters pass only 5% of the rows, only 5% of the table will be passed from the storage to Spark instead of the full table.

If your projection selects only 3 columns out of 10, then less columns will be passed from the storage to Spark and if your storage is columnar (e.g. Parquet, not Avro) and the non selected columns are not a part of the filter, then these columns won't even have to be read.




回答2:


In set and bag relational algebra alike, predicate pushdown eliminates tuples.

In bag relational algebra, projection pushdown eliminates attributes ("columns"), but in case of column based storage, it doesn't matter much, becasue columns that are not used higher up aren't being carried to begin with. Even a row based database may or may not benefit from projection pushdown (even SQL doesn't specify a physical access plan). Projection in bag RA is a very nominal operation that can be physically done at just the metadata level (flag some columns as inaccessible).

In set relational algebra, projection pushdown generally eliminates tuples as well, so this is where it has significance. Set RA projection is not an inexpensive operation, due to the need for deduplication. It's like a GROUP BY with no aggregated fields. Still, it's often worth doing the projection before a join, due to a possible vast decrease of tuple count.

Bag algebra tools eg. SQL also have ways for set RA projection, such as SELECT DISTINCT.

Neither predicate pushdown nor projection pushdown lead to more optimal execution in all cases. Depends on their selectivity and a lot of other things. Still, esp. predicate pushdown is a good heuristic as joins tend to be most expensive.

If the pushed-down projection has a sort index for the retained fields, or it needs to scan the table, there may be join algorithms with which the projection can be fused, avoiding a double reading of table/index structures.



来源:https://stackoverflow.com/questions/58235076/what-is-the-difference-between-predicate-pushdown-and-projection-pushdown

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!