We have a usecase where we have hundreds of millions of entries in a table and have a problem splitting it up further. 99% of operations are append-only. However, we have oc
There is relatively simple option we found efficient in similar scenarios with BigQuery.
It allows to handle queries based on any time based snapshot – as well as query current snapshot
In short, idea is in having one master table and daily history tables
During the day - current daily table is used for insertions (new, update, delete) and then daily process does merge of last completed daily table with master table writing it out back to same master table. Of course, first, backup is taken via copy of latest master table (free operation).
Daily master table update process allows to keep master table clean and fresh as of last day.
Now at any given moment you can have most recent data by querying only (junk-less) master table and today's table only.
At the same time, as you have all daily tables, you can query any historical data
Of course, classic option of adding all data (new, update, delete) into the master table with respective qualifiers still looks good both price and performance wise because your main (99%) data are new entries!
In your case, me personally, I would vote for classic approach with periodic cleaning of historical entries
Finally, in my mind, it is less about joining, but rather about union with use of table wildcard and window functions