BigQuery: JOIN ON with repeated / array STRUCT field in Standard SQL?

♀尐吖头ヾ 提交于 2021-01-28 13:39:53

问题


I have basically two tables, Orders and Items. As these tables are imported from Google Cloud Datastore backup files, references are not made by a simple ID field, but a <STRUCT> for one-to-one relationship, where its id field represents the actual unique ID I want to match. For one-to-many relationship (REPEATED), the schema uses ARRAY of <STRUCT>.

I can query the one-to-one relationships with a LEFT OUTER JOIN, I also know how to join on a non-repeated struct and a repeated string or int, but I have trouble to achieve a similar join query with a repeated struct.

One Order with one item:

#standardSQL
WITH Orders AS (
  SELECT 1 AS __oid__, STRUCT(STRUCT(2 AS id, "default" AS ns) AS key) AS item UNION ALL 
  SELECT 2 AS __oid__, STRUCT(STRUCT(4 AS id, "default" AS ns) AS key) AS item UNION ALL 
  SELECT 3 AS __oid__, STRUCT(STRUCT(6 AS id, "default" AS ns) AS key) AS item
),
Items AS (
  SELECT STRUCT(1 AS id, "default" AS ns) AS key, "#1.1" AS title UNION ALL
  SELECT STRUCT(2 AS id, "default" AS ns) AS key, "#1.2" AS title UNION ALL
  SELECT STRUCT(3 AS id, "default" AS ns) AS key, "#1.3" AS title UNION ALL
  SELECT STRUCT(4 AS id, "default" AS ns) AS key, "#1.4" AS title UNION ALL
  SELECT STRUCT(5 AS id, "default" AS ns) AS key, "#1.5" AS title UNION ALL
  SELECT STRUCT(6 AS id, "default" AS ns) AS key, "#1.6" AS title
)

SELECT
   __oid__
  ,Order_item AS item
FROM Orders  

LEFT OUTER JOIN(
  SELECT
     key
    ,title
  FROM Items
) Order_item
ON Order_item.key.id = item.key.id

Result (works as expected):

+-----+---------+--------------+-------------+------------+
| Row | __oid__ |  item.key.id | item.key.ns | item.title |
+-----+---------+--------------+-------------+------------+
|   1 |       1 |            2 |     default |       #1.2 |
+-----+---------+--------------+-------------+------------+
|   2 |       2 |            4 |     default |       #1.4 |
+-----+---------+--------------+-------------+------------+
|   3 |       3 |            6 |     default |       #1.6 |
+-----+---------+--------------+-------------+------------+

Similar query, but this time one order with many items:

#standardSQL
WITH Orders AS (
  SELECT 1 AS __oid__, ARRAY[STRUCT(STRUCT(1 AS id, "default" AS ns) AS key), STRUCT(STRUCT(2 AS id, "default" AS ns) AS key)] AS items UNION ALL 
  SELECT 2 AS __oid__, ARRAY[STRUCT(STRUCT(3 AS id, "default" AS ns) AS key), STRUCT(STRUCT(4 AS id, "default" AS ns) AS key)] AS items UNION ALL 
  SELECT 3 AS __oid__, ARRAY[STRUCT(STRUCT(5 AS id, "default" AS ns) AS key), STRUCT(STRUCT(6 AS id, "default" AS ns) AS key)] AS items
),
Items AS (
  SELECT STRUCT(1 AS id, "default" AS ns) AS key, "#1.1" AS title UNION ALL
  SELECT STRUCT(2 AS id, "default" AS ns) AS key, "#1.2" AS title UNION ALL
  SELECT STRUCT(3 AS id, "default" AS ns) AS key, "#1.3" AS title UNION ALL
  SELECT STRUCT(4 AS id, "default" AS ns) AS key, "#1.4" AS title UNION ALL
  SELECT STRUCT(5 AS id, "default" AS ns) AS key, "#1.5" AS title UNION ALL
  SELECT STRUCT(6 AS id, "default" AS ns) AS key, "#1.6" AS title
)

SELECT
   __oid__
  ,Order_items AS items
FROM Orders  

LEFT OUTER JOIN(
  SELECT
     key
    ,title
  FROM Items
) Order_items
ON Order_items.key.id IN (SELECT item.key.id FROM UNNEST(items) AS item)

Error: IN subquery is not supported inside join predicate.

I actually expected this result:

+-----+---------+--------------+-------------+------------+
| Row | __oid__ |  item.key.id | item.key.ns | item.title |
+-----+---------+--------------+-------------+------------+
|   1 |       1 |            1 |     default |       #1.1 |
|     |         |            2 |     default |       #1.2 |
+-----+---------+--------------+-------------+------------+
|   2 |       2 |            3 |     default |       #1.3 |
|     |         |            4 |     default |       #1.4 |
+-----+---------+--------------+-------------+------------+
|   3 |       3 |            5 |     default |       #1.5 |
|     |         |            6 |     default |       #1.6 |
+-----+---------+--------------+-------------+------------+

How do I change the second query to get the expected result?


回答1:


Alternative option is to do CROSS JOIN instead of LEFT JOIN

#standardSQL
WITH Orders AS (
  SELECT 1 AS __oid__, ARRAY[STRUCT(STRUCT(1 AS id, "default" AS ns) AS key), STRUCT(STRUCT(2 AS id, "default" AS ns) AS key)] AS items UNION ALL 
  SELECT 2 AS __oid__, ARRAY[STRUCT(STRUCT(3 AS id, "default" AS ns) AS key), STRUCT(STRUCT(4 AS id, "default" AS ns) AS key)] AS items UNION ALL 
  SELECT 3 AS __oid__, ARRAY[STRUCT(STRUCT(5 AS id, "default" AS ns) AS key), STRUCT(STRUCT(6 AS id, "default" AS ns) AS key)] AS items
),
Items AS (
  SELECT STRUCT(1 AS id, "default" AS ns) AS key, "#1.1" AS title UNION ALL
  SELECT STRUCT(2 AS id, "default" AS ns) AS key, "#1.2" AS title UNION ALL
  SELECT STRUCT(3 AS id, "default" AS ns) AS key, "#1.3" AS title UNION ALL
  SELECT STRUCT(4 AS id, "default" AS ns) AS key, "#1.4" AS title UNION ALL
  SELECT STRUCT(5 AS id, "default" AS ns) AS key, "#1.5" AS title UNION ALL
  SELECT STRUCT(6 AS id, "default" AS ns) AS key, "#1.6" AS title
)

SELECT
   __oid__
  ,ARRAY_AGG(Order_items) AS items
FROM Orders  

CROSS JOIN(
  SELECT
     key
    ,title
  FROM Items
) Order_items
WHERE Order_items.key.id IN (SELECT item.key.id FROM UNNEST(items) AS item)
GROUP BY __oid__



回答2:


The problem is that BigQuery can't hash-partition the join keys from the two sides (since the join is expressed as an IN condition). You can make this work by flattening the array on the left-hand side and then aggregating the items from the right:

#standardSQL
WITH Orders AS (
  SELECT 1 AS __oid__, ARRAY[STRUCT(STRUCT(1 AS id, "default" AS ns) AS key), STRUCT(STRUCT(2 AS id, "default" AS ns) AS key)] AS items UNION ALL 
  SELECT 2 AS __oid__, ARRAY[STRUCT(STRUCT(3 AS id, "default" AS ns) AS key), STRUCT(STRUCT(4 AS id, "default" AS ns) AS key)] AS items UNION ALL 
  SELECT 3 AS __oid__, ARRAY[STRUCT(STRUCT(5 AS id, "default" AS ns) AS key), STRUCT(STRUCT(6 AS id, "default" AS ns) AS key)] AS items
),
Items AS (
  SELECT STRUCT(1 AS id, "default" AS ns) AS key, "#1.1" AS title UNION ALL
  SELECT STRUCT(2 AS id, "default" AS ns) AS key, "#1.2" AS title UNION ALL
  SELECT STRUCT(3 AS id, "default" AS ns) AS key, "#1.3" AS title UNION ALL
  SELECT STRUCT(4 AS id, "default" AS ns) AS key, "#1.4" AS title UNION ALL
  SELECT STRUCT(5 AS id, "default" AS ns) AS key, "#1.5" AS title UNION ALL
  SELECT STRUCT(6 AS id, "default" AS ns) AS key, "#1.6" AS title
)

SELECT
   __oid__
  ,ARRAY_AGG(Order_items) AS items
FROM Orders,
UNNEST(items) AS item

LEFT OUTER JOIN(
  SELECT
     key
    ,title
  FROM Items
) Order_items
ON Order_items.key.id = item.key.id
GROUP BY __oid__

This looks like what you wanted in any case, since your original query would have had items just as a struct rather than as an array of structs.



来源:https://stackoverflow.com/questions/51136595/bigquery-join-on-with-repeated-array-struct-field-in-standard-sql

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!