问题
Anybody can explain this Bigquery query for deduplication? Why do we need to use [OFFSET(0)]? I think it is used to take the first element in aggregation array right? Isn't that the same as LIMIT 1? Why do we need to aggregation the entire table? Why can we aggregate an entire table in a single cell?
# take the one name associated with a SKU
WITH product_query AS (
SELECT
DISTINCT
v2ProductName,
productSKU
FROM `data-to-insights.ecommerce.all_sessions_raw`
WHERE v2ProductName IS NOT NULL
)
SELECT k.* FROM (
# aggregate the products into an array and
# only take 1 result
SELECT ARRAY_AGG(x LIMIT 1)[OFFSET(0)] k
FROM product_query x
GROUP BY productSKU # this is the field we want deduplicated
);
回答1:
Let's start with some data we want to de-duplicate:
WITH table AS (SELECT * FROM UNNEST([STRUCT('001' AS id, 1 AS a, 2 AS b), ('002', 3,5), ('001', 1, 4)]))
SELECT *
FROM table t
Now, instead of *
, I'm going to use t
to refer to the whole row:
SELECT t
FROM table t
What happens if I group each of these rows by their id:
SELECT t.id, ARRAY_AGG(t) tt
FROM table t
GROUP BY 1
Now I have all the rows with the same id grouped together. But let me choose only one:
SELECT t.id, ARRAY_AGG(t LIMIT 1) tt
FROM table t
GROUP BY 1
That might look good, but that's still one row inside one array. How can I get only the row, and not an array:
SELECT t.id, ARRAY_AGG(t LIMIT 1)[OFFSET(0)] tt
FROM table t
GROUP BY 1
And if I want to get back a row without the grouping id
, nor the tt
prefix:
SELECT tt.*
FROM (
SELECT t.id, ARRAY_AGG(t LIMIT 1)[OFFSET(0)] tt
FROM table t
GROUP BY 1
)
And that's how you de-duplicate rows based on the rows ids.
If you need to choose a particular row - for example the newest one given a timestamp, just order the aggregation like in ARRAY_AGG(t ORDER BY timestamp DESC LIMIT 1)
来源:https://stackoverflow.com/questions/53719148/big-query-deduplication-query-example-explanation