Difference in statistics from Google Analytics Report and BigQuery Data in Hive table

谁都会走 提交于 2019-12-04 16:41:55

Can you run the same query in BigQuery, instead of on the data exported to Hive?

My guess: "The data is stored in a hive table. I flattened all the nested and repeated fields before exporting." When flattening - are you repeating pageviews several times, with each repetition getting into the final wrong addition?

Note how data can get duplicated when flattening rows:

SELECT col, x FROM (
  SELECT "wrong" col, SUM(totals.pageviews) x
  FROM (FLATTEN ([google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910], hits))
), (
  SELECT "correct" col, SUM(totals.pageviews) x
  FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
)


col     x    
wrong   2262     
correct 249 

Update given "update 2" to the question:

Since BigQuery is working correctly, and this is a Hive problem, you should add that tag to get relevant answers.

Nevertheless, this is how I would correctly de-duplicate previously duplicated rows with BigQuery:

SELECT SUM(pv)
FROM (
  SELECT visitId, MAX(totals.pageviews) pv
  FROM (FLATTEN ([google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910], hits))
  GROUP EACH BY 1
)
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!