most reliable format for large bigquery load jobs

萝らか妹 提交于 2021-02-09 09:12:17

问题


I have a 100 GB table that I'm trying to load into google bigquery. It is stored as a single 100GB avro file on GCS.

Currently my bq load job is failing with an unhelpful error message:

UDF worker timed out during execution.; Unexpected abort triggered for
worker avro-worker-156907: request_timeout

I'm thinking of trying a different format. I understand that bigquery supports several formats (AVRO, JSON, CSV, Parquet, etc) and that in principle one can load large datasets in any of these formats.

However, I was wondering whether anyone here might have experience with which of these formats is most reliable / least prone to quirks in practice when loading into bigquery?


回答1:


Probably I'll solve following these steps:

  1. Creating a ton of small files in csv format
  2. Sending the files to GCS .

Command to copy files to GCS:

gsutil -m cp <local folder>/* gs:<bucket name>

gsutil -m option to perform a parallel (multi-threaded/multi-processing)

After that, I'll move from GCS to BQ using Cloud Dataflow default template. link. (Remember that using a default template you don't need code)

Here a example to invoke dataflow link :

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/latest/GCS_Text_to_BigQuery \
    --parameters \
javascriptTextTransformFunctionName=YOUR_JAVASCRIPT_FUNCTION,\
JSONPath=PATH_TO_BIGQUERY_SCHEMA_JSON,\
javascriptTextTransformGcsPath=PATH_TO_JAVASCRIPT_UDF_FILE,\
inputFilePattern=PATH_TO_YOUR_TEXT_DATA,\
outputTable=BIGQUERY_TABLE,\
bigQueryLoadingTemporaryDirectory=PATH_TO_TEMP_DIR_ON_GCS


来源:https://stackoverflow.com/questions/55463433/most-reliable-format-for-large-bigquery-load-jobs

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!