BigQuery argmax: Is array order maintained when doing CROSS JOIN UNNEST

匆匆过客 提交于 2020-01-05 08:36:18

问题


Question:

In BigQuery, standard SQL, if I run

SELECT *
FROM mytable
CROSS JOIN UNNEST(mytable.array)

Can I be certain that the resulting row order is the same as the array order?

Example:

Let's say I have the following table mytable:

Row | id   | prediction
1   | abcd | [0.2, 0.5, 0.3]

If I run SELECT * FROM mytable CROSS JOIN UNNEST(mytable.prediction), can I be certain that the row order is the same as the array order? I.e. will the resulting table always be:

Row | id   | unnested_prediction
1   | abcd | 0.2
2   | abcd | 0.5
3   | abcd | 0.3

More background on use case (argmax):

I'm trying to find the array index with the largest value for the array in each row (argmax), i.e. the second element (0.5) in the array above. My target output is thus something like this:

Row | id   | argmax
1   | abcd | 2

Using CROSS JOIN, a DENSE_RANK window function ordered by the prediction value and a ROW_NUMBER window function to find the argmax, I am able to make this work with some test data. You can verify with this query:

WITH predictions AS (
  SELECT 'abcd' AS id, [0.2, 0.5, 0.3] AS prediction
  UNION ALL
  SELECT 'efgh' AS id, [0.7, 0.2, 0.1] AS prediction
),
ranked_predictions AS (
  SELECT 
    id,
    ROW_NUMBER() OVER (PARTITION BY id) AS rownum, -- This is the ordering I'm curious about
    DENSE_RANK() OVER (PARTITION BY id ORDER BY flattened_prediction DESC) AS array_rank
  FROM
     predictions P
  CROSS JOIN
    UNNEST(P.prediction) AS flattened_prediction
)
SELECT
  id,
  rownum AS argmax
FROM
  ranked_predictions
WHERE array_rank = 1

It could just be a coincidence that ROW_NUMBER behaves well in my tests (i.e. that it is ordered according to the unnested array), so it would be nice to be certain.


回答1:


Short answer: no, order is not guaranteed to be maintained.

Long answer: in practice, you'll most likely see that order is maintained, but you should not depend on it. The example that you provided is similar to this type of query:

SELECT *
FROM (
  SELECT 3 AS x UNION ALL
  SELECT 2 UNION ALL
  SELECT 1
  ORDER BY x
)

What is the expected order of the output? The ORDER BY is in the subquery, and the outer query doesn't impose any ordering, so BigQuery (or whatever engine you run this in) is free to reorder the rows in the output as it sees fit. You may end up getting back 1, 2, 3, or you may receive 3, 2, 1 or any other ordering. The more general principle is that projections are not order-preserving.

While arrays have a well-defined order of their elements, when you use the UNNEST function, you're converting the array into a relation, which doesn't have a well-defined order unless you use ORDER BY. For example, consider this query:

SELECT ARRAY(SELECT x + 1 FROM UNNEST(arr) AS x) AS new_arr
FROM (SELECT [1, 2, 3] AS arr)

The new_arr array isn't actually guaranteed to have the elements [2, 3, 4] in that order, since the query inside the ARRAY function doesn't use ORDER BY. You can address this non-determinism by ordering based on the element offsets, however:

SELECT ARRAY(SELECT x + 1 FROM UNNEST(arr) AS x WITH OFFSET ORDER BY OFFSET) AS new_arr
FROM (SELECT [1, 2, 3] AS arr)

Now the output is guaranteed to be [2, 3, 4].

Going back to your original question, you can ensure that you get deterministic output by imposing an ordering in the subquery that computes the row numbers:

ranked_predictions AS (
  SELECT 
    id,
    ROW_NUMBER() OVER (PARTITION BY id ORDER BY OFFSET) AS rownum,
    DENSE_RANK() OVER (PARTITION BY id ORDER BY flattened_prediction DESC) AS array_rank
  FROM
     predictions P
  CROSS JOIN
    UNNEST(P.prediction) AS flattened_prediction WITH OFFSET
)

I added the WITH OFFSET after the UNNEST, and ORDER BY OFFSET inside the ROW_NUMBER window in order to ensure that the row numbers are computed based on the original ordering of the array elements.




回答2:


Can I be certain that the resulting row order is the same as the array order?

you should use WITH OFFSET to get position of the elements in the array, so then you can use them for ordering in your further logic

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 'abcd' id, [0.2, 0.5, 0.3] prediction
)
SELECT id, unnested_prediction
FROM `project.dataset.table`, 
UNNEST(prediction) unnested_prediction WITH OFFSET pos
ORDER BY id, pos  



回答3:


Seems like it keeps the ordering of array intact, by default.

However, one possible way to be 100% sure is to impose some sort of insignificant sorting, which will tell the query processor in the BQ blackbox to not use any sort of default ordering if it tries to.

Something like:

WITH predictions AS (
  SELECT 'abcd' AS id, [2.1, 0.1, 0.1, 0.2] AS prediction
)
select id, p from predictions
cross join unnest(prediction) p
order by 1=1


来源:https://stackoverflow.com/questions/53635020/bigquery-argmax-is-array-order-maintained-when-doing-cross-join-unnest

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!