BigQuery autodetect doesn't work with inconsistent json?

问题

I'm trying to upload JSON to BigQuery, with --autodetect so I don't have to manually discover and write out the whole schema. The rows of JSON don't all have the same form, and so fields are introduced in later rows that aren't in earlier rows.

Unfortunately I get the following failure:

Upload complete.
Waiting on bqjob_r1aa6e3302cfc399a_000001712c8ea62b_1 ... (1s) Current status: DONE
BigQuery error in load operation: Error processing job '[...]:bqjob_r1aa6e3302cfc399a_000001712c8ea62b_1': Error while reading data, error message: JSON table encountered too many errors, giving up.
Rows: 1209; errors: 1. Please look into the errors[] collection for more details.
Failure details:
- Error while reading data, error message: JSON processing
encountered too many errors, giving up. Rows: 1209; errors: 1; max
bad: 0; error percent: 0
- Error while reading data, error message: JSON parsing error in row
starting at position 829980: No such field:
mc.marketDefinition.settledTime.

Here's the data I'm uploading: https://gist.github.com/max-sixty/c717e700a2774ba92547c7585b2b21e3

Maybe autodetect uses the first n rows, and then fails if rows after n are different? If that's the case, is there any way of resolving this?

Is there any tool I could use to pull out the schema from the whole file and then pass to BigQuery explicitly?

回答1:

I found two tools that can help:

bigquery-schema-generator 0.5.1 that uses all the data to get the schema instead of 100 sample rows like BigQuery.
Spark SQL, you should to setup your dev env, or at least install Spark and invoke the spark-shell tool.

However, I noticed that the file is intended to fail, see this text in the link you shared: "Sample for BigQuery autodetect failure". So, I'm not pretty sure that such tools can work for a json file intended to fail.

The last but not least, I got the json imported after I removed manually the problematic field: "settledTime":"2020-03-01T02:55:47.000Z".

Hope this info helps.

回答2:

Yes, see documentation here: https://cloud.google.com/bigquery/docs/schema-detect

When auto-detection is enabled, BigQuery starts the inference process by selecting a random file in the data source and scanning up to 100 rows of data to use as a representative sample. BigQuery then examines each field and attempts to assign a data type to that field based on the values in the sample.

So if the data in the rest of the rows does not comply with initial rows, you should not use autodetect and need to provide explicit schema.

回答3:

Autodetect may not work well since it looks only into the first 100 rows to detect schema. Using schema detection for JSON could be a costly endeavor.

How about using BqTail with AllowFieldAddition option allowing cost-effectively expand schema.

You could simply use the following ingestion workflow with CLI or serverless

bqtail -r=rule.yaml -s=sourceURL

@rule.yaml

When:
  Prefix: /data/somefolder
  Suffix: .json
Async: false

Dest:
  Table: mydataset.mytable
  AllowFieldAddition: true
  Transient:
    Template: mydataset.myTableTempl
    Dataset: temp

Batch:
  MultiPath: true
  Window:
    DurationInSec: 15
OnSuccess:
  - Action: delete

See JSON with allow field addition e2e test case

来源：https://stackoverflow.com/questions/60936644/bigquery-autodetect-doesnt-work-with-inconsistent-json

标签

google-bigquery