Reshape from wide to long in big query (standard SQL)

假装没事ソ 提交于 2020-06-07 21:39:31

问题


Unfortunately reshaping in BQ it's not as easy as in R and I can't export my data for this project.

Here is input

date    country A             B         C      D
20170928    CH  3000.3        121       13     3200
20170929    CH  2800.31       137       23     1614.31

Expected output

date    country Metric  Value  
20170928    CH  A       3000.3  
20170928    CH  B       121     
20170928    CH  C       13     
20170928    CH  D       3200
20170929    CH  A       2800.31 
20170929    CH  B       137       
20170929    CH  C       23     
20170929    CH  D       1614.31

Also my table has many more columns and rows (but I assume a lot of manual will be required)


回答1:


Below is for BigQuery Standard SQL and does not require repeating selects depends on number of columns. It will pick as many as you have and transform them into metrics and values

#standardSQL
SELECT DATE, country,
  metric, SAFE_CAST(value AS FLOAT64) value
FROM (
  SELECT DATE, country, 
    REGEXP_REPLACE(SPLIT(pair, ':')[OFFSET(0)], r'^"|"$', '') metric, 
    REGEXP_REPLACE(SPLIT(pair, ':')[OFFSET(1)], r'^"|"$', '') value 
  FROM `project.dataset.yourtable` t, 
  UNNEST(SPLIT(REGEXP_REPLACE(to_json_string(t), r'{|}', ''))) pair
)
WHERE NOT LOWER(metric) IN ('date', 'country')

You can test / play with above using dummy data as in your question

#standardSQL
WITH `project.dataset.yourtable` AS (
  SELECT '20170928' DATE, 'CH' country, 3000.3 A, 121 B, 13 C, 3200 D UNION ALL
  SELECT '20170929', 'CH', 2800.31, 137, 23, 1614.31
)
SELECT DATE, country,
  metric, SAFE_CAST(value AS FLOAT64) value
FROM (
  SELECT DATE, country, 
    REGEXP_REPLACE(SPLIT(pair, ':')[OFFSET(0)], r'^"|"$', '') metric, 
    REGEXP_REPLACE(SPLIT(pair, ':')[OFFSET(1)], r'^"|"$', '') value 
  FROM `project.dataset.yourtable` t, 
  UNNEST(SPLIT(REGEXP_REPLACE(to_json_string(t), r'{|}', ''))) pair
)
WHERE NOT LOWER(metric) IN ('date', 'country')

result is as expected

DATE        country metric  value    
20170928    CH      A       3000.3   
20170928    CH      B       121.0    
20170928    CH      C       13.0     
20170928    CH      D       3200.0   
20170929    CH      A       2800.31  
20170929    CH      B       137.0    
20170929    CH      C       23.0     
20170929    CH      D       1614.31  



回答2:


You need UNION which is denoted using commas in bigquery

SELECT date, country, Metric, Value
FROM (
  SELECT date, country, 'A' as Metric,  A as Value FROM your_table
), (
  SELECT date, country, 'B' as Metric,  B as Value FROM your_table
), (
  SELECT date, country, 'C' as Metric,  C as Value FROM your_table
) , (
  SELECT date, country, 'D' as Metric,  D as Value FROM your_table
)



回答3:


Most answers that I managed to find required specifying the name of EVERY column to be melt. This is not tractable when I have hundreds to thousands of columns in the table. Here is an answer that works for an arbitrarily wide table.

It utilizes dynamic SQL and automatically extract multiple column names from the data schema, collate a command string and then evaluate that string. This is intended to mimic Python pandas.melt() / R reshape2::melt() behavior.

I intentionally did not create user defined functions because of some undesirable properties of UDFs. Depending on how you use this, you may or may not want to do that.

Input:

id0 id1 _2020_05_27 _2020_05_28
1   1   11          12
1   2   13          14
2   1   15          16
2   2   17          18

Output:

id0 id1 date         value
1   2   _2020_05_27  13
1   2   _2020_05_28  14
2   2   _2020_05_27  17
2   2   _2020_05_28  18
1   1   _2020_05_27  11
1   1   _2020_05_28  12
2   1   _2020_05_27  15
2   1   _2020_05_28  16
#standardSQL

-- PANDAS MELT FUNCTION IN GOOGLE BIGQUERY
-- author: Luna Huang
-- email: lunahuang@google.com

-- run this script with Google BigQuery Web UI in the Cloud Console

-- this piece of code functions like the pandas melt function
-- pandas.melt(id_vars, value_vars, var_name, value_name, col_level=None)
-- without utilizing user defined functions (UDFs)
-- see below for where to input corresponding arguments

DECLARE cmd STRING;
DECLARE subcmd STRING;
SET cmd = ("""
  WITH original AS (
    -- query to retrieve the original table
    %s
  ),
  nested AS (
    SELECT
    [
      -- sub command to be automatically generated
      %s
    ] as s,
    -- equivalent to id_vars in pandas.melt()
    %s,
    FROM original
  )
  SELECT
    -- equivalent to id_vars in pandas.melt()
    %s,
    -- equivalent to var_name in pandas.melt()
    s.key AS %s,
    -- equivalent to value_name in pandas.melt()
    s.value AS %s,
  FROM nested
  CROSS JOIN UNNEST(nested.s) AS s
""");
SET subcmd = ("""
  WITH
  columns AS (
    -- query to retrieve the column names
    -- equivalent to value_vars in pandas.melt()
    -- the resulting table should have only one column
    -- with the name: column_name
    %s
  ),
  scs AS (
    SELECT FORMAT("STRUCT('%%s' as key, %%s as value)", column_name, column_name) AS sc
    FROM columns
  )
  SELECT ARRAY_TO_STRING(ARRAY (SELECT sc FROM scs), ",\\n")
""");

-- -- -- EXAMPLE BELOW -- -- --

-- SET UP AN EXAMPLE TABLE --
CREATE OR REPLACE TABLE `tmp.example`
(
  id0 INT64,
  id1 INT64,
  _2020_05_27 INT64,
  _2020_05_28 INT64,
);
INSERT INTO `tmp.example` VALUES (1, 1, 11, 12);
INSERT INTO `tmp.example` VALUES (1, 2, 13, 14);
INSERT INTO `tmp.example` VALUES (2, 1, 15, 16);
INSERT INTO `tmp.example` VALUES (2, 2, 17, 18);

-- MELTING STARTS --
-- execute these two command to melt the table

-- the first generates the STRUCT commands
-- and saves a string in subcmd
EXECUTE IMMEDIATE FORMAT(
  -- please do not change this argument
  subcmd,
  -- query to retrieve the column names
  -- equivalent to value_vars in pandas.melt()
  -- the resulting table should have only one column
  -- with the name: column_name
  """
    SELECT column_name
    FROM `tmp.INFORMATION_SCHEMA.COLUMNS`
    WHERE (table_name = "example") AND (column_name NOT IN ("id0", "id1"))
  """
) INTO subcmd;

-- the second implements the melting
EXECUTE IMMEDIATE FORMAT(
  -- please do not change this argument
  cmd,
  -- query to retrieve the original table
  """
    SELECT *
    FROM `tmp.example`
  """,
  -- please do not change this argument
  subcmd,
  -- equivalent to id_vars in pandas.melt()
  -- !!please type these twice!!
  "id0, id1", "id0, id1",
  -- equivalent to var_name in pandas.melt()
  "date",
  -- equivalent to value_name in pandas.melt()
  "value"
);


来源:https://stackoverflow.com/questions/47650782/reshape-from-wide-to-long-in-big-query-standard-sql

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!