Why doesn't LAST_VALUE return the last value?

和自甴很熟 提交于 2020-06-09 08:54:00

问题


I want to find the last value of y over an ordered partition using a query like this:

SELECT
  x,
  LAST_VALUE(y) OVER (PARTITION BY x ORDER BY y ASC)
FROM table

But LAST_VALUE returns lots of values that aren't the last value (in this case, the largest value) of y for a given partition. Why?

(In this case, MAX can be used instead of LAST_VALUE to find the largest value, but why doesn't LAST_VALUE return the max value too?)


回答1:


TLDR: The query you want is:

SELECT
  x,
  LAST_VALUE(y) OVER (PARTITION BY x ORDER BY y ASC
    ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
FROM table

Possibly followed by GROUP BY to collapse duplicate output rows from the analytic function.

And of course, it's simpler to just use MAX over an unordered partition if that's all you need:

SELECT
  x,
  MAX(y) OVER (PARTITION BY x)
FROM table

Before answering this question, here's a little background on analytic functions (a.k.a. window functions). All of the below is standard SQL and not specific to BigQuery.

First, analytic functions are not aggregation functions. Whereas aggregation functions collapse multiple input rows into a single output row, analytic functions compute exactly one output row for every input row. So you need to make sure you're thinking about what the output is for every input row.

Second, analytic functions operate over a "window" of rows that is a subset of the "partition" to which the row belongs. The partition for an input row is determined by the PARTITION BY clause, or you can omit it if you want the partition to be the entire set of input rows. The window is given by the ROWS clause, but if you don't specify it (and users usually don't), it defaults to either the entire partition (when no ordering is applied) or the set of rows in the partition from the first row to the current row (when an ORDER BY is present). Note that the window can differ for each input row in a partition!

Now, back to LAST_VALUE. Although the default window described above is reasonable in many cases (e.g., computing cumulative sums), it works spectacularly poorly with LAST_VALUE. The LAST_VALUE function returns the value of the last row in the window, and by default the last row in the window is the current row.

So to fix the problem, you need to explicitly specify that the window for LAST_VALUE is the entire partition, not just the rows up to the current row. You can do so as follows:

SELECT x, LAST_VALUE(y) OVER (PARTITION BY x ORDER BY y ASC
  ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
FROM table

To test this out, here's an example:

SELECT
  x,
  FIRST_VALUE(x) OVER (ORDER BY x ASC) first_asc,
  FIRST_VALUE(x) OVER (ORDER BY x DESC) first_desc,
  LAST_VALUE(x) OVER (ORDER BY x ASC) last_asc,
  LAST_VALUE(x) OVER (ORDER BY x DESC) last_desc,
FROM
  (SELECT 4 as x),
  (SELECT 2 as x),
  (SELECT 1 as x),
  (SELECT 3 as x)

x,first_asc,first_desc,last_asc,last_desc
1,1,4,1,1
2,1,4,2,2
3,1,4,3,3
4,1,4,4,4

Note that LAST_VALUE returns 1, 2, 3, 4 instead of just 4 because the window changes for each input row.

Now let's specify a window that is the entire partition:

SELECT
  x,
  FIRST_VALUE(x) OVER (ORDER BY x ASC
    ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) first_asc,
  FIRST_VALUE(x) OVER (ORDER BY x DESC
    ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) first_desc,
  LAST_VALUE(x) OVER (ORDER BY x ASC
    ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) last_asc,
  LAST_VALUE(x) OVER (ORDER BY x DESC
    ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) last_desc,
FROM
  (SELECT 4 as x),
  (SELECT 2 as x),
  (SELECT 1 as x),
  (SELECT 3 as x)

x,first_asc,first_desc,last_asc,last_desc
1,1,4,4,1
2,1,4,4,1
3,1,4,4,1
4,1,4,4,1

Now we get 4 for LAST_VALUE as expected.




回答2:


Even that question title uses LAST_VALUE - the question itself asks for Largest Value!
I would just go with below:

SELECT x, MAX(y) OVER (PARTITION BY x) FROM table  

And if no other fields from table involved - i would just do simple GROUP BY:

SELECT x, MAX(y) FROM table GROUP BY x 

Of course, we shoudl remember that not always Largest value and MAX value are the same thing.




回答3:


Other option you have is to change the query order by to desc

SELECT
  x,
  LAST_VALUE(y) OVER (PARTITION BY x ORDER BY y ASC)
FROM table
order by x desc

but the you will get the last value only for the first row



来源:https://stackoverflow.com/questions/35096414/why-doesnt-last-value-return-the-last-value

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!