Confusion when uploading a JSON from googlecloud storage to bigquery

僤鯓⒐⒋嵵緔 提交于 2019-12-11 17:55:49

问题


Hello this is a 2 part question

1) Currently I am trying to upload a file from google cloud storage to bigquery via a python script. I am trying to follow the steps given by the google help site.

https://cloud.google.com/bigquery/docs/loading-data-cloud-storage#bigquery-import-gcs-file-python

def load_data_from_gcs(dataset_name, table_name, source):
    bigquery_client = bigquery.Client()
    dataset = bigquery_client.dataset(dataset_name)
    table = dataset.table(table_name)
    job_name = str(uuid.uuid4())

    job = bigquery_client.load_table_from_storage(
        job_name, table, source)

    job.begin()

    wait_for_job(job)

    print('Loaded {} rows into {}:{}.'.format(
        job.output_rows, dataset_name, table_name))

I am not sure what to put in for the first line of "load_data_from_gcs" because in google cloud there are no tables it is JSON file I am trying to upload. Would the "table" part be the name of the table I am trying to create or is it talking about the bucket because there is no part to specify which bucket I want to pull from.

This is the code I have so far.

import json
import argparse
import time
import uuid

from google.cloud import bigquery
# from google.cloud import storage

def load_data_from_gcs('dataworks-356fa', table_name, 'pullnupload.json'):
    bigquery_client = bigquery.Client('dataworks-356fa')
    dataset = bigquery_client.dataset('FirebaseArchive')
    table = dataset.table(table_name)
    job_name = str(uuid.uuid4())

    job = bigquery_client.load_table_from_storage(
        job_name, table, source)

    job.begin()

    wait_for_job(job)

    print('Loaded {} rows into {}:{}.'.format(
        job.output_rows, dataset_name, table_name))

part 2) I want this script to run weekly and be able to either delete the old table and create a new one or either only filter in the non-duplicated data. Whichever is easier.

Thank you for your help.


回答1:


Not sure what problem you are having but to load data from a file from GCS to BigQuery is exactly how you are already doing.

If you have a table with this schema:

[{"name": "id", "type": "INT64"}, {"name": "name", "type": "STRING"}]

And if you have this file in GCS (located for instance at "gs://bucket/json_data.json"):

{"id": 1, "name": "test1"}
{"id": 2, "name": "test2"}

You'd just need now to set the job object to process a JSON file as input, like so:

def load_data_from_gcs('dataworks-356fa', table_name, 'pullnupload.json'):
    bigquery_client = bigquery.Client('dataworks-356fa')
    dataset = bigquery_client.dataset('FirebaseArchive')
    table = dataset.table(table_name)
    job_name = str(uuid.uuid4())

    job = bigquery_client.load_table_from_storage(
        job_name, table, "gs://bucket/json_data.json")

    job.source_format = 'NEWLINE_DELIMITED_JSON'
    job.begin()

And just it.

(If you have a CSV file then you have to set your job object accordingly).

As for the second question, it's really a matter of trying it out different approaches and seeing which works best for you.

To delete a table, you'd just need to run:

table.delete()

To remove duplicated data from a table one possibility would be to write a query that removes the duplication and saves the results to the same table. Something like:

query_job = bigquery_client.run_async_query(query=your_query, job_name=job_name)
query_job.destination = Table object
query_job.write_disposition = 'WRITE_TRUNCATE'
query_job.begin()


来源:https://stackoverflow.com/questions/44876686/confusion-when-uploading-a-json-from-googlecloud-storage-to-bigquery

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!