Migrating Non Partitioned Streaming Table to Partitioned Table Bigquery

问题

I have a legacy unpartitioned big query table that streams logs from various sources (Let's say Table BigOldA). The aim is to transfer it to a new day partition table (Let's say PartByDay) which is done with the help of the following link:

https://cloud.google.com/bigquery/docs/creating-column-partitions#creating_a_partitioned_table_from_a_query_result

bq query 
--allow_large_results 
--replace=true 
--destination_table <project>:<data-set>.<PartByDay>
--time_partitioning_field REQUEST_DATETIME 
--use_legacy_sql=false 'SELECT * FROM `<project>.<data-set>.<BigOldA>`'

I have migrated the historical data to the new table but I cannot delete them in Table BigOldA as I am running into the same problem with running DMLs on streaming buffer tables are not supported yet.

Error: UPDATE or DELETE DML statements are not supported over 
table <project>:<data-set>.BigOldA with streaming buffer

I was planning to run batch jobs everyday transferring T-1 data from Table BigOldA to Table PartByDay and deleting them periodically so that I can still maintain the streaming buffer data in Table BigOldA and start using PartByDay Table for analytics. Now I am not sure if it's achievable.

I am looking for an alternative solution or best practice on how to periodically transfer & maintain stream buffering table to partitioned table. Also, as the data is streaming from independent production sources it's not possible to point all sources streaming to PartByDay and streamingbuffer properties from tables.get is never null.

回答1:

You could just delete the original table and then rename the migrated table to the original name after you've run the your history job. This assumes your streaming component to BigQuery is fault tolerant. If it's designed well, you shouldn't lose any data. Whatever is streaming to BigQuery should be able to store events until the table comes back online. It shouldn't change anything for your components that are streaming once the table is partitioned.

回答2:

If anyone interested in the script, here you go.

#!/bin/sh
# This script
# 1. copies the data as the partitioned table
# 2. delete the unpartitioned table
# 3. copy the partitioned table to the same dataset table name
# TODO 4. deletes the copied table

set -e
source_project="<source-project>"
source_dataset="<source-dataset>"
source_table="<source-table-to-partition>"

destination_project="<destination-project>"
destination_dataset="<destination-dataset>"
partition_field="<timestamp-partition-field>"
destination_table="<table-copy-partition>"

source_path="$source_project.$source_dataset.$source_table"
source_l_path="$source_project:$source_dataset.$source_table"
destination_path="$destination_project:$destination_dataset.$destination_table"

echo "copying table from $source_path to $destination_path"
query=$(cat <<-END
SELECT * FROM \`$source_path\`
END
)

echo "deleting old table"
bq rm -f -t $destination_path
echo "running the query: $query"

bq query --quiet=true --use_legacy_sql=false --apilog=stderr --allow_large_results --replace=true --destination_table $destination_path --time_partitioning_field $partition_field "$query"

echo "removing the original table: $source_path"
bq rm -f -t $source_l_path
echo "table deleted"
echo "copying the partition table to the original source path"
bq cp -f -n $destination_path $source_l_path

来源：https://stackoverflow.com/questions/49754400/migrating-non-partitioned-streaming-table-to-partitioned-table-bigquery

标签

google-bigquery