问题
I'm trying to upload JSON to BigQuery, with --autodetect
so I don't have to manually discover and write out the whole schema. The rows of JSON don't all have the same form, and so fields are introduced in later rows that aren't in earlier rows.
Unfortunately I get the following failure:
Upload complete.
Waiting on bqjob_r1aa6e3302cfc399a_000001712c8ea62b_1 ... (1s) Current status: DONE
BigQuery error in load operation: Error processing job '[...]:bqjob_r1aa6e3302cfc399a_000001712c8ea62b_1': Error while reading data, error message: JSON table encountered too many errors, giving up.
Rows: 1209; errors: 1. Please look into the errors[] collection for more details.
Failure details:
- Error while reading data, error message: JSON processing
encountered too many errors, giving up. Rows: 1209; errors: 1; max
bad: 0; error percent: 0
- Error while reading data, error message: JSON parsing error in row
starting at position 829980: No such field:
mc.marketDefinition.settledTime.
Here's the data I'm uploading: https://gist.github.com/max-sixty/c717e700a2774ba92547c7585b2b21e3
Maybe autodetect uses the first n rows, and then fails if rows after n are different? If that's the case, is there any way of resolving this?
Is there any tool I could use to pull out the schema from the whole file and then pass to BigQuery explicitly?
回答1:
I found two tools that can help:
bigquery-schema-generator 0.5.1 that uses all the data to get the schema instead of 100 sample rows like BigQuery.
Spark SQL, you should to setup your dev env, or at least install Spark and invoke the spark-shell tool.
However, I noticed that the file is intended to fail, see this text in the link you shared: "Sample for BigQuery autodetect failure". So, I'm not pretty sure that such tools can work for a json file intended to fail.
The last but not least, I got the json imported after I removed manually the problematic field: "settledTime":"2020-03-01T02:55:47.000Z".
Hope this info helps.
回答2:
Yes, see documentation here: https://cloud.google.com/bigquery/docs/schema-detect
When auto-detection is enabled, BigQuery starts the inference process by selecting a random file in the data source and scanning up to 100 rows of data to use as a representative sample. BigQuery then examines each field and attempts to assign a data type to that field based on the values in the sample.
So if the data in the rest of the rows does not comply with initial rows, you should not use autodetect and need to provide explicit schema.
回答3:
Autodetect may not work well since it looks only into the first 100 rows to detect schema. Using schema detection for JSON could be a costly endeavor.
How about using BqTail with AllowFieldAddition option allowing cost-effectively expand schema.
You could simply use the following ingestion workflow with CLI or serverless
bqtail -r=rule.yaml -s=sourceURL
@rule.yaml
When:
Prefix: /data/somefolder
Suffix: .json
Async: false
Dest:
Table: mydataset.mytable
AllowFieldAddition: true
Transient:
Template: mydataset.myTableTempl
Dataset: temp
Batch:
MultiPath: true
Window:
DurationInSec: 15
OnSuccess:
- Action: delete
See JSON with allow field addition e2e test case
来源:https://stackoverflow.com/questions/60936644/bigquery-autodetect-doesnt-work-with-inconsistent-json