athena presto - multiple columns from long to wide

柔情痞子 提交于 2021-02-04 08:37:25

问题


I am new to Athena and I am trying to understand how to turn multiple columns from long to wide format. It seems like presto is what is needed, but I've only successfully been able to apply map_agg to one variable. I think my below final outcome can be achieved with multimap_agg but cannot quite get it to work.

Below I walk through my steps and data. If you have some suggestions or questions, please let me know!

First, the data starts like this:

id  | letter    | number   | value
------------------------------------
123 | a         | 1        | 62
123 | a         | 2        | 38
123 | a         | 3        | 44
123 | b         | 1        | 74
123 | b         | 2        | 91
123 | b         | 3        | 97
123 | c         | 1        | 38
123 | c         | 2        | 98
123 | c         | 3        | 22
456 | a         | 1        | 99
456 | a         | 2        | 33
456 | a         | 3        | 81
456 | b         | 1        | 34
456 | b         | 2        | 79
456 | b         | 3        | 43
456 | c         | 1        | 86
456 | c         | 2        | 60
456 | c         | 3        | 59

Then I transform the data into the below using filtering with the where clause and then joining:

id  | letter  | 1  | 2  | 3
----------------------------
123 | a       | 62 | 38 | 44
123 | b       | 74 | 91 | 97
123 | c       | 38 | 98 | 22
456 | a       | 99 | 33 | 81
456 | b       | 34 | 79 | 43
456 | c       | 86 | 60 | 59

For the final outcome, I would like to transform it into the below:

id  | a_1   | a_2   | a_3   | b_1   | b_2   | b_3   | c_1   | c_2   | c_3
--------------------------------------------------------------------------
123 | 62    | 38    | 44    | 74    | 91    | 97    | 38    | 98    | 22
456 | 99    | 33    | 81    | 34    | 79    | 43    | 86    | 60    | 59

回答1:


You can use window functions and conditional aggregation. This requires that you know in advance the possible letters, and the maximum rows per id/letter tuple:

select
    id,
    max(case when letter = 'a' and rn = 1 then value end) a_1,
    max(case when letter = 'a' and rn = 2 then value end) a_2,
    max(case when letter = 'a' and rn = 3 then value end) a_3,
    max(case when letter = 'b' and rn = 1 then value end) b_1,
    max(case when letter = 'b' and rn = 2 then value end) b_2,
    max(case when letter = 'b' and rn = 3 then value end) b_3,
    max(case when letter = 'c' and rn = 1 then value end) c_1,
    max(case when letter = 'c' and rn = 2 then value end) c_2,
    max(case when letter = 'c' and rn = 3 then value end) c_3
from (
    select 
        t.*, 
        row_number() over(partition by id, letter order by number) rn
    from mytable t
) t
group by id

Actually, if the numbers are always 1, 2, 3, then you don't even need the window function:

select
    id,
    max(case when letter = 'a' and number = 1 then value end) a_1,
    max(case when letter = 'a' and number = 2 then value end) a_2,
    max(case when letter = 'a' and number = 3 then value end) a_3,
    max(case when letter = 'b' and number = 1 then value end) b_1,
    max(case when letter = 'b' and number = 2 then value end) b_2,
    max(case when letter = 'b' and number = 3 then value end) b_3,
    max(case when letter = 'c' and number = 1 then value end) c_1,
    max(case when letter = 'c' and number = 2 then value end) c_2,
    max(case when letter = 'c' and number = 3 then value end) c_3
from mytable t
group by id



回答2:


Athena needs the columns to be known at query time, but the next best thing is using a map, as you hint to in your question.

One way to achieve the results you are after is this query (the_table refers to the first table in your questions, the one with id, letter, number, and value columns):

SELECT
  id,
  map_agg(letter || '_' || CAST(number AS varchar), value) AS letter_number_value
FROM the_table
GROUP BY id

Which gives this result:

id  | letter_number_value
----+-------------------------------------------------------------------------
123 | {a_1=62, a_2=38, a_3=44, b_1=74, b_2=91, b_3=97, c_1=38, c_2=98, c_3=22}
456 | {a_1=99, a_2=33, a_3=81, b_1=34, b_2=79, b_3=43, c_1=86, c_2=60, c_3=59}

I cheated slightly by manually sorting the map keys, if you run the query they will end up in arbitrary order, but I figured that this way it is easier to see that the result is the desired.

Please note that this assumes there are no duplicate letter/number combinations, if there are I think it's undefined which value will end up in the result.

Also note that Athena's output format for maps is ambiguous and that there are situations where you can end up with unparseable results (for example when keys or values include equal signs or commas). Therefore I would recommend casting the map as JSON and using a JSON parser in your application code, e.g. CAST(map_agg(…) AS JSON).



来源:https://stackoverflow.com/questions/63142257/athena-presto-multiple-columns-from-long-to-wide

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!