BigQuery GA Exported with Duplicated Rows

南笙酒味 提交于 2019-12-02 15:10:05

问题


We have been trying to explain why this happened in all of our datasets but so far we had no success.

We observed that starting on 18 April our ga_sessions dataset had for the most part duplicated entries (like 99% of rows). As an example, I tested this query:

SELECT
  fullvisitorid fv,
  visitid v,
  ARRAY(
  SELECT
    AS STRUCT hits.*
  FROM
    UNNEST(hits) hits
  ORDER BY
    hits.hitnumber) h
FROM
  `dafiti-analytics.40663402.ga_sessions*`
WHERE
  1 = 1
  AND REGEXP_EXTRACT(_table_suffix, r'.*_(.*)') BETWEEN FORMAT_DATE("%Y%m%d", DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY))AND FORMAT_DATE("%Y%m%d", DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY))
ORDER BY
  fv,
  v
LIMIT
  100

And the result was:

We tried to investigate when this began to happen, so I ran this query:

SELECT
  date,
  f,
  COUNT(f) freq from(
  SELECT
    date,
    fullvisitorid fv,
    visitid v,
    COUNT(CONCAT(fullvisitorid, CAST(visitid AS string))) f
  FROM
    `dafiti-analytics.40663402.ga_sessions*`
  WHERE
    1 = 1
    AND PARSE_TIMESTAMP('%Y%m%d', REGEXP_EXTRACT(_table_suffix, r'.*_(.*)')) BETWEEN TIMESTAMP('2017-04-01')
    AND TIMESTAMP('2017-04-30')
  GROUP BY
    fv,
    v,
    date )
GROUP BY
  f,
  date
ORDER BY
  date,
  freq DESC

And we found that for 3 of our projects it started on day 18 April but in accounts related to LATAM data we started seeing duplicated rows just recently as well.

We also checked if in our GCP Console something was logged but couldn't find anything.

Is there some mistake we could have made that caused the duplication in the ga_sessions export? We checked our analytics tracking but it seems to be working just fine. Also there's no modification we did these days that explain it as well.

If you need more info please let me know,


回答1:


Make sure to match only the intraday or non-intraday tables. For intraday:

`dafiti-analytics.40663402.ga_sessions_intraday*`

For non-intraday:

`dafiti-analytics.40663402.ga_sessions_2017*`

The important part is to include enough of the prefix to match the desired tables.



来源:https://stackoverflow.com/questions/43942988/bigquery-ga-exported-with-duplicated-rows

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!