most reliable format for large bigquery load jobs

问题

I have a 100 GB table that I'm trying to load into google bigquery. It is stored as a single 100GB avro file on GCS.

Currently my bq load job is failing with an unhelpful error message:

UDF worker timed out during execution.; Unexpected abort triggered for
worker avro-worker-156907: request_timeout

I'm thinking of trying a different format. I understand that bigquery supports several formats (AVRO, JSON, CSV, Parquet, etc) and that in principle one can load large datasets in any of these formats.

However, I was wondering whether anyone here might have experience with which of these formats is most reliable / least prone to quirks in practice when loading into bigquery?

回答1:

Probably I'll solve following these steps:

Creating a ton of small files in csv format
Sending the files to GCS .

Command to copy files to GCS:

gsutil -m cp <local folder>/* gs:<bucket name>

gsutil -m option to perform a parallel (multi-threaded/multi-processing)

After that, I'll move from GCS to BQ using Cloud Dataflow default template. link. (Remember that using a default template you don't need code)

Here a example to invoke dataflow link :

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/latest/GCS_Text_to_BigQuery \
    --parameters \
javascriptTextTransformFunctionName=YOUR_JAVASCRIPT_FUNCTION,\
JSONPath=PATH_TO_BIGQUERY_SCHEMA_JSON,\
javascriptTextTransformGcsPath=PATH_TO_JAVASCRIPT_UDF_FILE,\
inputFilePattern=PATH_TO_YOUR_TEXT_DATA,\
outputTable=BIGQUERY_TABLE,\
bigQueryLoadingTemporaryDirectory=PATH_TO_TEMP_DIR_ON_GCS

来源：https://stackoverflow.com/questions/55463433/most-reliable-format-for-large-bigquery-load-jobs

标签

google-bigquery