Efficiently write a Pandas dataframe to Google BigQuery

不问归期 提交于 2019-12-03 00:07:33

I did the comparison for alternative 1 and 3 in Datalab using the following code:

from datalab.context import Context
import datalab.storage as storage
import datalab.bigquery as bq
import pandas as pd
from pandas import DataFrame
import time

# Dataframe to write
my_data = [{1,2,3}]
for i in range(0,100000):
    my_data.append({1,2,3})
not_so_simple_dataframe = pd.DataFrame(data=my_data,columns=['a','b','c'])

#Alternative 1
start = time.time()
not_so_simple_dataframe.to_gbq('TestDataSet.TestTable', 
                 Context.default().project_id,
                 chunksize=10000, 
                 if_exists='append',
                 verbose=False
                 )
end = time.time()
print("time alternative 1 " + str(end - start))

#Alternative 3
start = time.time()
sample_bucket_name = Context.default().project_id + '-datalab-example'
sample_bucket_path = 'gs://' + sample_bucket_name
sample_bucket_object = sample_bucket_path + '/Hello.txt'
bigquery_dataset_name = 'TestDataSet'
bigquery_table_name = 'TestTable'

# Define storage bucket
sample_bucket = storage.Bucket(sample_bucket_name)

# Create or overwrite the existing table if it exists
table_schema = bq.Schema.from_dataframe(not_so_simple_dataframe)

# Write the DataFrame to GCS (Google Cloud Storage)
%storage write --variable not_so_simple_dataframe --object $sample_bucket_object

# Write the DataFrame to a BigQuery table
table.insert_data(not_so_simple_dataframe)
end = time.time()
print("time alternative 3 " + str(end - start))

and here are the results for n = {10000,100000,1000000}:

n       alternative_1  alternative_3
10000   30.72s         8.14s
100000  162.43s        70.64s
1000000 1473.57s       688.59s

Judging from the results, alternative 3 is faster than alternative 1.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!