What does it mean to have multiple sortkey columns?

后端 未结 3 508
温柔的废话
温柔的废话 2020-12-24 02:58

Redshift allows designating multiple columns as SORTKEY columns, but most of the best-practices documentation is written as if there were only a single SORTKEY.

3条回答
  •  旧巷少年郎
    2020-12-24 03:58

    We are also using Redshift and we have about 2 billion records (+20 million every day) and I have to say, the less selective the sort_key is, the more ahead it should be in the sort_key list.

    In our case (and please be advised to analyze how you use/query your own data) we used timestamp as first sort_key. The problem with this is, that even within 1 second we record about 200 rows, which results our 1MB blocks contain only a few seconds, and every type of data in that single block. Meaning, even though timestamp is highly selective, after we cannot really filter further as we have all kinds of data in every block.

    Recently we have reversed the order of the sort_keys. The first one has about 15 different values, the second has about 30, etc... and timestamp is the last one now, but still, one block is still measured in seconds.

    This results, (since we use the first two sort_keys as filters very frequently) the following: Old solution: A year of data, select a month, it drops 91% of the blocks, but after it has to open all of them, even though we want to filter further.

    The new solution drops about 14/15 of the blocks in the first step, regardless of the date range, then about 95% of the remaining ones, and timestamp still drops 91% of the remaining ones.

    We have tested it thoroughly with two, 800 million records tables, which were the same, except the order of the sort keys. The higher the time-period in the 'where' clause was, the better results we got. It got even more significant in case of joins obviously.

    So my suggestion is, know your database and what kind of queries you run frequently, because the most selective column might not be the best first sort_key. Just as Enno Shioji said, it all depends on by what you are filtering.

提交回复
热议问题