问题
I have a 100 GB table that I'm trying to load into google bigquery. It is stored as a single 100GB avro file on GCS.
Currently my bq load
job is failing with an unhelpful error message:
UDF worker timed out during execution.; Unexpected abort triggered for
worker avro-worker-156907: request_timeout
I'm thinking of trying a different format. I understand that bigquery supports several formats (AVRO, JSON, CSV, Parquet, etc) and that in principle one can load large datasets in any of these formats.
However, I was wondering whether anyone here might have experience with which of these formats is most reliable / least prone to quirks in practice when loading into bigquery?
回答1:
Probably I'll solve following these steps:
- Creating a ton of small files in csv format
- Sending the files to GCS .
Command to copy files to GCS:
gsutil -m cp <local folder>/* gs:<bucket name>
gsutil -m option to perform a parallel (multi-threaded/multi-processing)
After that, I'll move from GCS to BQ using Cloud Dataflow default template. link. (Remember that using a default template you don't need code)
Here a example to invoke dataflow link :
gcloud dataflow jobs run JOB_NAME \
--gcs-location gs://dataflow-templates/latest/GCS_Text_to_BigQuery \
--parameters \
javascriptTextTransformFunctionName=YOUR_JAVASCRIPT_FUNCTION,\
JSONPath=PATH_TO_BIGQUERY_SCHEMA_JSON,\
javascriptTextTransformGcsPath=PATH_TO_JAVASCRIPT_UDF_FILE,\
inputFilePattern=PATH_TO_YOUR_TEXT_DATA,\
outputTable=BIGQUERY_TABLE,\
bigQueryLoadingTemporaryDirectory=PATH_TO_TEMP_DIR_ON_GCS
来源:https://stackoverflow.com/questions/55463433/most-reliable-format-for-large-bigquery-load-jobs