Using a CASE statement to change the value of a new BigQuery column based finding one specific entry inside a PARTITION

落爺英雄遲暮 提交于 2020-01-22 03:30:07

问题


I trying to write some case statements which might change the value of all entries in the call if a particular condition is satisfied INSIDE the partition. Here is the specific context. Imagine that I have a particular data set that was created using the following SQL query:

SELECT date, CONCAT(fullVisitorId, STRING(visitId)) AS unique_visit_id, visitId, visitNumber, fullVisitorId, totals.pageviews, totals.bounces, 
LAG(hits.page.pagePath,1) OVER(PARTITION BY unique_visit_id ORDER BY hits.time ASC) as lagged, hits.page.pagePath, hits.page.pageTitle, device.deviceCategory, device.browser, device.browserVersion, hits.customVariables.index,
hits.customVariables.customVarName, hits.customVariables.customVarValue, hits.time
FROM (FLATTEN([XXXXXXXX.ga_sessions_20140711], hits.time))
WHERE hits.customVariables.index = 4
LIMIT 1000;

The resulting data sets looks similar to the following (shown in excel):

Note that the unique_visit_id has the same number in it for each unique visit. What I would like to do in many instances is run through the hits_page_pagePath. I would like to construct a CASE statement such that, when the lagged URL (found using REGEX_MATCH()) equals a particular value, and the value of the hits_page_pagePath equals a certain value when hits_time = 0, then create a new column using case that labels the entire partition a certain value. For example, let's say that I found an error in the hits_page_pagePath and the lagged value was a certain value. In this case, I would then make the entire partition labelled "Booking error". If the lagged value was a different one before the error, I would make the partition be a different label, such as "Payment error". The table would then look like the one below:

This would repeat for all the unique_visit_id partitions. I would then be able to group together counts of total bounces, hits, events, etc., for each partition. Any insight would be greatly appreciated!


回答1:


It is entirely possible that this could be done with a smart usage of analytics functions, but my SQL-fu isn't up to it. That said, it sounds like what you want is achievable with a simple JOIN statement. Let's say your current query is called Q (you could even save this as a view to make it easier).

Run

SELECT t1.*, t2.has_some_property
FROM Q AS t1
LEFT OUTER JOIN (
  SELECT unique_visit_id, 1 as has_some_property
  FROM Q 
  WHERE (REGEXP_MATCH(lagged, ...) 
      AND REGEXP_MATCH(hits.page.pagePath))
  GROUP BY unique_visit_id
  ) AS t2
ON t1.unique_visit_id == t2.unique_visit_id



回答2:


If you are looking for avoiding joins, you can use an aggregated function with Over. something like:

Max(If((Your Condition here),Your value here, Null)) Over( Partition By Your_Partition)

the window functions used to had some performance issues that should have been improved recently. My experience with BQ drives me to prefer Jordan's Join suggestion. But hey, its a fun riddle...



来源:https://stackoverflow.com/questions/24747744/using-a-case-statement-to-change-the-value-of-a-new-bigquery-column-based-findin

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!