问题
My data source is based on the events happening in the 3rd party tool. eg: - customer.created
, customer.updated
, customer.plan.updated
. Every event has different json
schema. And it is possible even the same event eg: customer.updated
may have different schema from previous customer.updated
event.
I am planning to load this data into BigQuery
but it appears that BigQuery
doesn't support dynamic schema. I am building a data warehouse and want to store all the events related to customer in same table.
Is bigquery
a right tool for such data? Are there other better options in GCP bigtable/cloud datastore/cloud sql etc. for such type of data? One of the requirement is that, if possible, the data can be queried easily by non-technical people who can do simple select
,join
queries.
回答1:
You'll get the best results in BigQuery when you can put your data in well defined columns, but you will also get great results if you just store JSON objects stored as strings.
For example, see how https://www.githubarchive.org/ does it:
- GitHub Archive stores many types of GitHub events. Most of them have the same set of properties - so we can store them in pre-defined columns.
- Some GitHub events schemas are different for each type of event, and they also keep changing through time. Instead of dealing with schema changes, we just store them as JSON strings and we query them in real time.
#standardSQL
SELECT JSON_EXTRACT_SCALAR(payload, '$.pull_request.base.repo.language') lang
, COUNT(*) c
FROM `githubarchive.month.201612`
WHERE type='PullRequestEvent'
GROUP BY lang
HAVING lang IS NOT null
ORDER BY c DESC
LIMIT 10
来源:https://stackoverflow.com/questions/45037415/how-to-create-a-table-in-bigquery-using-python-when-schema-keeps-changing