How to create a table in Bigquery using Python when schema keeps changing?

不问归期 提交于 2020-01-07 03:53:32

问题


My data source is based on the events happening in the 3rd party tool. eg: - customer.created, customer.updated, customer.plan.updated. Every event has different json schema. And it is possible even the same event eg: customer.updated may have different schema from previous customer.updated event.

I am planning to load this data into BigQuery but it appears that BigQuery doesn't support dynamic schema. I am building a data warehouse and want to store all the events related to customer in same table.

Is bigquery a right tool for such data? Are there other better options in GCP bigtable/cloud datastore/cloud sql etc. for such type of data? One of the requirement is that, if possible, the data can be queried easily by non-technical people who can do simple select,join queries.


回答1:


You'll get the best results in BigQuery when you can put your data in well defined columns, but you will also get great results if you just store JSON objects stored as strings.

For example, see how https://www.githubarchive.org/ does it:

  • GitHub Archive stores many types of GitHub events. Most of them have the same set of properties - so we can store them in pre-defined columns.
  • Some GitHub events schemas are different for each type of event, and they also keep changing through time. Instead of dealing with schema changes, we just store them as JSON strings and we query them in real time.

#standardSQL
SELECT JSON_EXTRACT_SCALAR(payload, '$.pull_request.base.repo.language') lang
  , COUNT(*) c 
FROM `githubarchive.month.201612`
WHERE type='PullRequestEvent'
GROUP BY lang 
HAVING lang IS NOT null
ORDER BY c DESC
LIMIT 10



来源:https://stackoverflow.com/questions/45037415/how-to-create-a-table-in-bigquery-using-python-when-schema-keeps-changing

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!