Flattening Google Analytics data (with repeated fields) not working anymore

后端 未结 5 1976
情话喂你
情话喂你 2020-12-16 16:54

We have a premium Google Analytics account, which will give us access to row level event data. This data is exported daily to Google Bigquery and for every day a new table w

相关标签:
5条回答
  • 2020-12-16 16:54

    This is actually the result of a bugfix I submitted last week, and is preventing you from getting incorrect results.

    BigQuery by default flattens all query results before returning them, but we only want to flatten one independently repeated field to avoid a cross-product expansion of data. The bug was that our checks for multiple repeated fields failed to take into account repeatedness of parent records in some cases, which caused us to fail to flatten some independently repeated fields. This meant that we could return flat rows where independently repeated values were actually "flattened" into dependently repeated values, instead of generating the cross-product, which is actually a wrong result.

    What you're seeing here is a result of the stricter check: you have (at least) two repeated fields in your output schema before we attempt to flatten the results.

    Another important thing to note is that the FLATTEN([table-value], [field]) function only flattens the repeatedness of the field you specify as the second argument. When you say flatten([xxxxxxxx.ga_sessions_20140829],hits), you are flattening only the "hits" record. If you also want to flatten its repeated children (product, promotion, etc.) you must explicitly add another flatten for those fields, like:

    FLATTEN(FLATTEN([xxxxxxxx.ga_sessions_20140829],hits),hits.product)

    --

    You have a couple options to make your example work:

    1) Select fewer fields. If you only care about getting flattened output of a few fields, you can remove the independently repeated fields from your query results by explicitly selecting only the fields you care about.

    2) Add more FLATTENs. You'll need to flatten on each repeated field, which looks to include at least hits, hits.product, and customDimensions. You may find that the error message will then complain about different repeated fields: add more FLATTENs on the repeated fields in your schema until it works.

    0 讨论(0)
  • 2020-12-16 16:55

    If you're using the BigQuery Web Console, select a Destination Table, click Allow Large Results and uncheck Flatten Results.

    If you're using bq command-line tool:

    bq query --allow_large_results --noflatten --destination_table NAME_OF_TABLE  "SELECT * from FLATTEN( [dataset], hits)"
    
    0 讨论(0)
  • 2020-12-16 17:02

    standardsql

    For those who want to import GA data to a relational database: Please be aware that a nested schema is actually putting multiple relational tables into one nested structure - they are equivalent, so plain flattening is not the best solution here: you want to separate the tables again! That way you save a lot of space in storage and you queries might get faster.

    You can think of GA data being 3 relational tables:

    • sessions table
    • hits table
    • products table

    To get sessions table you can use something like

    -- session table 
    SELECT
      * EXCEPT(hits,
        customDimensions),
      CONCAT('{',(
        SELECT
          STRING_AGG(CONCAT(CAST(index AS string),':"',value,'"') )
        FROM
          t.customdimensions),'}') as customDimensions
    FROM
      `project.dataset.ga_sessions_20171031` AS t
    

    customDimensions is aggregated to a json string.

    Similarly with the hits table:

    -- hits table
    SELECT
     fullvisitorid,
     visitid,
     visitstarttime,
     h.* EXCEPT(product,
     customdimensions, customMetrics, customVariables, promotion, experiment),
     CONCAT('{',(
       SELECT
         STRING_AGG(CONCAT(CAST(index AS string),':"',value,'"') )
       FROM
         h.customdimensions),'}') AS hitsCustomDimensions,
     CONCAT('{',(
       SELECT STRING_AGG(CONCAT(CAST(index AS string),':',cast(value as string)) )
       FROM h.custommetrics),'}') AS hitsCustomMetrics
    FROM
     `project.dataset.ga_sessions_20171031` AS t,
     t.hits AS h
    

    You can add promotion and experiment as I did with customDimensions using a sub-select.

    Join sessions with hits using a concatenation of fullvisitorid + visitstarttime/visitid (GA sessions include midnight split: use visitStartTime instead of visitid! If you want to ignore midnight split use visitid - it stays the same despite the split)

    For products you'd add the hit number to your 'session id' to get a unique identifier:

    -- product table
    SELECT
      fullvisitorid,
      visitid,
      visitstarttime,
      h.hitNumber,
      p.* EXCEPT(customdimensions, customMetrics),
      CONCAT('{',(
        SELECT STRING_AGG(CONCAT(CAST(index AS string),':"',value,'"') )
        FROM p.customdimensions),'}') AS productsCustomDimensions,
      CONCAT('{',(
        SELECT STRING_AGG(CONCAT(CAST(index AS string),':',cast(value as string)) )
        FROM p.custommetrics),'}') AS productsCustomMetrics
    FROM
      `project.dataset.ga_sessions_20171031` AS t,
      t.hits AS h, h.product as p
    

    Now you have 3 relational / "flat" tables you can join as needed in any relational db.

    0 讨论(0)
  • 2020-12-16 17:02

    I suspect the tables generated by Google Analytics Premium have been augmented with the addition of new columns. Adding columns shouldn't be a problem, unless your queries use the * selector, and one of the new columns happens to contain nested values.

    Recommended solution: Instead of using *, explicitly ask for the columns you need.

    0 讨论(0)
  • 2020-12-16 17:12

    You have to use flatten the query before connecting. Simply pick the fields that are mentioned in the error and use the nested flattening:

    SELECT * FROM 
    flatten(flatten(flatten([xxxxxxxx.ga_sessions_20140829],hits),customDimensions_value),hits_product_productSKU),
    flatten(flatten(flatten([xxxxxxxx.ga_sessions_20140828],hits),customDimensions_value),hits_product_productSKU),
    flatten(flatten(flatten([xxxxxxxx.ga_sessions_20140827],hits),customDimensions_value),hits_product_productSKU),    flatten(flatten(flatten([xxxxxxxx.ga_sessions_20140826],hits),customDimensions_value),hits_product_productSKU)
    
    0 讨论(0)
提交回复
热议问题