Athena: Minimize data scanned by query including JOIN operation

こ雲淡風輕ζ 提交于 2019-12-11 00:01:34

问题


Let there be an external table in Athena which points to a large amount of data stored in parquet format on s3. It contains a lot of columns and is partitioned on a field called 'timeid'. Now, there's another external table (small one) which maps timeid to date.

When the smaller table is also partitioned on timeid and we join them on their partition id (timeid) and put date into where clause, only those specific records are scanned from large table which contain timeids corresponding to that date. The entire data is not scanned here.

However, if the smaller table is not partitioned on timeid, full data scan takes place even in the presence of condition on date column.

Is there a way to avoid full data scan even when the large partitioned table is joined with an unpartitioned small table? This is required because the small table contains only one record per timeid and it might not be expected to create a separate file for each.


回答1:


That's an interesting discovery!

You might be able to avoid the large scan by using a sub-query instead of a join.

Instead of:

SELECT ...
FROM large-table
JOIN small-table
WHERE small-table.date > '2017-08-03'

you might be able to use:

SELECT ...
FROM large-table
WHERE large-table.date IN
         (SELECT date from small-table
          WHERE date > '2017-08-03')

I haven't tested it, but that would avoid the JOIN you mention.



来源:https://stackoverflow.com/questions/45470467/athena-minimize-data-scanned-by-query-including-join-operation

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!