问题
I have hourly batch jobs that need to scan all the data that has streamed into my table in the last hour. Right now I'm using a date-partitioned table, which means that every time I scan a date partition for an hour's worth of data, I have to scan rows from all hours of that day.
I've been thinking about clustering this table on an hour field, however I'm under the impression that BigQuery won't actually keep the table effectively clustered in the face of streaming inserts. So here's my question:
Does BigQuery guarantee to keep clustered tables sorted even in the face of streaming inserts?
回答1:
Currently the answer is no, clustered tables do not remain sorted/clustered in the face of streaming inserts. Many thanks to Tamir for pointing out that there's an answer relevant to this question here. Check that answer out for details as well as a trick to force sorting on part of a partition.
It also looks like the BigQuery team is working on this. According this issue tracker comment from April 17, 2019:
We are doing some a fair amount of work with streaming to keep the table clustered upto a certain recent time interval. We don't have a good ETA to offer on this at this point, but we hope to have more information on this soon.
来源:https://stackoverflow.com/questions/55723409/bigquery-do-clustered-tables-remain-sorted-in-the-face-of-streaming-inserts