Big Query Deduplication query example explanation

元气小坏坏 提交于 2020-07-30 07:50:21

问题


Anybody can explain this Bigquery query for deduplication? Why do we need to use [OFFSET(0)]? I think it is used to take the first element in aggregation array right? Isn't that the same as LIMIT 1? Why do we need to aggregation the entire table? Why can we aggregate an entire table in a single cell?

 # take the one name associated with a SKU
    WITH product_query AS (
      SELECT 
      DISTINCT 
      v2ProductName,
      productSKU
      FROM `data-to-insights.ecommerce.all_sessions_raw` 
      WHERE v2ProductName IS NOT NULL 
    )
    SELECT k.* FROM (
    # aggregate the products into an array and 
      # only take 1 result
      SELECT ARRAY_AGG(x LIMIT 1)[OFFSET(0)] k 
      FROM product_query x 
      GROUP BY productSKU # this is the field we want deduplicated
    );

回答1:


Let's start with some data we want to de-duplicate:

WITH table AS (SELECT * FROM UNNEST([STRUCT('001' AS id, 1 AS a, 2 AS b), ('002', 3,5), ('001', 1, 4)]))

SELECT *
FROM table t

Now, instead of *, I'm going to use t to refer to the whole row:

SELECT t
FROM table t

What happens if I group each of these rows by their id:

SELECT t.id, ARRAY_AGG(t) tt
FROM table t
GROUP BY 1

Now I have all the rows with the same id grouped together. But let me choose only one:

SELECT t.id, ARRAY_AGG(t LIMIT 1) tt
FROM table t
GROUP BY 1

That might look good, but that's still one row inside one array. How can I get only the row, and not an array:

SELECT t.id, ARRAY_AGG(t LIMIT 1)[OFFSET(0)] tt
FROM table t
GROUP BY 1

And if I want to get back a row without the grouping id, nor the tt prefix:

SELECT tt.*
FROM (
  SELECT t.id, ARRAY_AGG(t LIMIT 1)[OFFSET(0)] tt
  FROM table t
  GROUP BY 1
)

And that's how you de-duplicate rows based on the rows ids.

If you need to choose a particular row - for example the newest one given a timestamp, just order the aggregation like in ARRAY_AGG(t ORDER BY timestamp DESC LIMIT 1)



来源:https://stackoverflow.com/questions/53719148/big-query-deduplication-query-example-explanation

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!