How can I write this postgres query in Amazon redshift such that it is as optimized as it was in postgres?

冷暖自知 提交于 2019-12-25 03:54:25

问题


Here is my original query that I was using in postgres -

SELECT a.id,
    (SELECT val
       FROM database.detail x
      WHERE name = 'blablah'
        AND x.id = b.id) AS myGroup,
    c.username,
    a.someCode,
    a.timeTaken,
    a.date ::timestamp WITH time ZONE AT time ZONE 'PST' AS date,
    SUM (CASE WHEN (b.name = 'name1') THEN b.val ::INTEGER ELSE 0 END ) AS name11,
    SUM (CASE WHEN (b.name = 'name2') THEN b.val ::INTEGER ELSE 0 END ) AS name12
FROM
    database.myTable a,
    database.detail b,
    database.client c
WHERE
    a.id = b.id
    AND a.c_id = c.c_id
    AND a.date > current_date - interval '2 weeks'
GROUP BY 1, 2, 3, 4, 5, 6

Following is how I converted this query into Amazon redshift query.

SELECT a.id,
    b.val AS myGroup,
    c.username,
    a.someCode,
    a.timeTaken,
    convert_timezone('PST', a.date) AS date,
    SUM (CASE WHEN (b.name = 'name1') THEN b.val ::INTEGER ELSE 0 END ) AS name11,
    SUM (CASE WHEN (b.name = 'name2') THEN b.val ::INTEGER ELSE 0 END ) AS name12
FROM
    database.myTable a,
    database.detail b,
    database.client c
WHERE
    a.id = b.id
    AND b.name = 'blablah'
    AND a.c_id = c.c_id
    AND a.date > current_date - interval '2 weeks'
GROUP BY 1, 2, 3, 4, 5, 6 LIMIT 10

The CASE statement does not seem to be executing the way it is expected, basically the values for name11 and name12 are all zero. My postgres query returns valid values for these but the redshift query does not.

Also, this query is very very slow. Postgres query takes some 150 ms and this query takes 2 mins.

How can we do this better?


回答1:


Redshift Query optimization comes from Cluster, Table Design, DataLoading, Data Vacuuming &Analyzing over the table.

Let me answer some core touch points in the above list. 1. Make Sure your table mytable, detail, client has proper SORT_KEY, DIST_KEY 2. Make Sure all your tables in join are analzed and vaccumed properly.

Here is another version of your same SQL written in Redshift format.

Few Tweaks I made are

  1. Used "With Clause" to Optimized Cluster level computation
  2. Used Joins the proper way and make sure left/right join matters based on data.
  3. Used date_range with clause table for kind of object orientation.
  4. Used Group By in the main SQL below.

My Version of Redshift SQL

/** Date Range Computation **/
with date_range as (
    select ( current_Date - interval '2 weeks' ) as two_weeks
),
/** Filter main ResultSet**/
myGroupSet as (
    SELECT b.val AS myGroup,
           c.username,
           a.someCode,
           a.timeTaken,
           (case when (b.name == 'name1') THEN b.val::INTEGER ELSE 0 END ) as name11,
           (case when (b.name == 'name2') THEN b.val::INTEGER ELSE 0 END ) as name12
      FROM database.myTable a,
      join date_range dr on a.date > dr.two_weeks
      join database.detail b on b.id = a.id
      join database.client c on c.c_id = a.c_id
     where a.date > current_Date - interval '2 weeks'
)
/** Apply Aggregation **/
select myGroup, username, someCode, timeTaken, date,
       sum(name1), sum(name2)
  from myGroupSet
  group by myGroup, username, someCode, timeTaken, date


来源:https://stackoverflow.com/questions/38231015/how-can-i-write-this-postgres-query-in-amazon-redshift-such-that-it-is-as-optimi

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!