问题
I have a large data set in Google Bigquery with millions of rows of dirty data (App tracking) that I am trying to clean up. One of my problems is that the same data got sent to different columns for different events triggered in the App. By this I mean that maybe the country was sent to custom dimension 1 for some events, but to custom dimension 147 for other events. I can't post actual data, but a SELECT * FROM table_with_dirty_data
would produce something like this:
date | session | eventAction | cd001 | cd002 | cd004 | cd005
-----|---------|-------------|-------|----------|----------|-------
1 | 1 | 'event_1' | '1' | 'Pizza' | null | '21'
1 | 1 | 'event_2' | '10' | '25' | 'Pizza' | '14.56'
1 | 1 | 'event_3' | '3.1' | null | '15' | 'France'
1 | 2 | 'event_1' | '6' | 'Burger' | null | '21'
1 | 2 | 'event_2' | '21' | '25' | 'Burger' | '12.6'
The ultimate goal here is to end up with a clean table that can be analyzed. Something like this:
date | session | eventAction | country | vendor | product | price
-----|---------|-------------|----------|----------|----------|-------
1 | 1 | 'event_1' | 'France' | '1' | 'Pizza' | '14.56'
1 | 1 | 'event_2' | 'France' | '1' | 'Pizza' | '14.56'
1 | 1 | 'event_3' | 'France' | '1' | 'Pizza' | '14.56'
1 | 2 | 'event_1' | 'Spain' | '25' | 'Burger' | '12.6'
1 | 2 | 'event_2' | 'Spain' | '25' | 'Burger' | '12.6'
I am aware that some events will need some degree of statistical imputation and data type casting, but for now I am only concerned with getting every variable into its own column. So I've created an auxiliary table (let's call it matrix
) that looks like this:
event_name | variable_1 | variable_2 | variable_3
-----------|------------|------------|----------
'event_1' | 'cd020' | 'cd035' | 'cd120'
'event_2' | 'cd020' | 'cd146' | 'cd056'
'event_3' | 'cd001' | 'cd020' | 'cd035'
and so on, where the value in each cell of column variable_#
is the column name in the table_with_dirty_data
where the information can be found. That is, variable_1
can be found in a column called cd020
for events with event_name 'event_1' and 'even_2', but in a column called cd001
for events with event_name 'event_3'. So basically what matrix
does is to map to which custom dimension each variable was sent for each event.
Now...I have hundreds of different events on table_with_dirty_data
, and matrix
holds all 200 GA custom dimensions, so doing something like
SELECT
CASE
WHEN event_name = 'event_1' THEN cd020
WHEN event_name = 'event_2' THEN cd020
WHEN event_name = 'event_3' THEN cd001
END AS variable_1
, CASE
WHEN event_name = 'event_1' THEN cd035
WHEN event_name = 'event_2' THEN cd146
WHEN event_name = 'event_3' THEN cd020
END AS variable_2
, CASE
WHEN event_name = 'event_1' THEN cd120
WHEN event_name = 'event_2' THEN cd056
WHEN event_name = 'event_3' THEN cd035
END AS variable_3
FROM table_with_dirty_data
would take me a looooong time and is very error-prone. What I am trying to do is to use SELECT
statements that return the column name (cd###
) from the table_with_dirty_data
where the information is stored and the use WHILE
to loop over all different events. So, for example, with event_name
= 'event_1' it would be something like this:
SELECT
CASE WHEN event_name = 'event_1' THEN (SELECT variable_1 FROM matrix WHERE event_name = 'event_1') END AS variable_1
, CASE WHEN event_name = 'event_1' THEN (SELECT variable_2 FROM matrix WHERE event_name = 'event_1') END AS variable_2
, CASE WHEN event_name = 'event_1' THEN (SELECT variable_3 FROM matrix WHERE event_name = 'event_1') END AS variable_3
FROM table_with_dirty_data
WHERE event_name = 'event_1'
The goal here is that I can then loop over an array with all the event_name
s (something that can be easily achieved). Ultimately I need all events in one table, but I'm ok with having a table for curating each event, as long as it can be done programmatically (something I personally don't know if it's even possible in GBQ...would have to check on that).
The issue is that the SELECT
statements I am using get evaluated to a string, so the result of the queries inside the CASE
clause are string literals. For example, if I were to run it for event = 'event_1',
SELECT variable_1 FROM matrix WHERE event_name = 'event_1'
evaluates to
'cd020'
which then cause the outer query to become
SELECT
CASE WHEN event_name = 'event_1' THEN 'cd020' END AS variable_1
, CASE WHEN event_name = 'event_1' THEN 'cd035' END AS variable_2
, CASE WHEN event_name = 'event_1' THEN 'cd120' END AS variable_3
FROM table_with_dirty_data
WHERE event_name = 'event_1'
which produces a table like this
event_name | variable_1 | variable_2 | variable_3
-----------|------------|------------|----------
'event_1' | 'cd020' | 'cd035' | 'cd120'
'event_1' | 'cd020' | 'cd035' | 'cd120'
'event_1' | 'cd020' | 'cd035' | 'cd120'
instead of returning the value stored in columns cd020
, cd035
or cd120
for each time the event 'event_1' is found.
Does anyone know a way to unquote the result of those inner queries so that when the outer query executes, they are transformed into column names (so 'cd020' becomes cd020
)??
PS: I am also open to a totally different strategy if anyone knows one.
回答1:
Sounds like a case where EXECUTE IMMEDIATE might be helpful. I'm not sure I fully understand the mapping, but it should allow you to compose a dynamic statement based off your matrix
table.
However, it is unclear based on the description how unwieldy the generated query might be, as you mention having hundreds of these kinds of events and you may start running into other problems such as query complexity or length limits.
回答2:
I solved the problem using the EXECUTE IMMEDIATE
clause as recommended by @shollyman. I did it for just one of the events but I believe this answers the original question (scaling it to other events is just a matter of writing a WHILE
loop). I'll go step by step since I'm not using an actual repex.
Step one I declared a variable called event
with the name of the event that the query needs to look for.
DECLARE event STRING DEFAULT 'event_1';
Then I declared a variable that holds the column name where the query needs to look for the variable.
DECLARE variable_name STRING DEFAULT (SELECT variable_1 from matrix WHERE event_name = event);
Then I wrote the query as I would normally do, but using the EXECUTE IMMEDIATE
clause. I used triple double-quotes so that I could break it into several lines for readability).
EXECUTE IMMEDIATE CONCAT("""
SELECT
CASE WHEN event_name = '""", event, "' THEN ", variable_name, """ END AS variable_1
FROM table_with_dirty_data
WHERE event_name = '""", event, """'
""");
If others are going to use this, do take note of the lonely single quotes that I used before or after some triple double-quotes. I did that because, for example, the declared variable event
, even though it's a string, seems to get concatenated as event (no single quotes around it) and that breaks the query execution.
来源:https://stackoverflow.com/questions/62144073/how-to-convert-strings-into-column-names-in-google-bigquery