Best way to create randomly assigned partitions in Google BigQuery

问题

I have a BigQuery table that is not randomly sorted. The IDs are also not random. I would like to partition the data into chunks based on a random number, so that I can use those chunks for various parts of the project.

The solution I have in mind is to add two columns to my table: a randomly generated number, and a partition number. I am following this code snippet on AI Platform Notebooks.

The only substantive difference is I've changed the query_job line to

traintestsplit="""
DECLARE randn NUMERIC; 
DECLARE split INT64 default 0; 
LOOP
  SET randn = RAND();  
  IF (randn < (1/3)) THEN
    SET split = 1;
  END IF; 
  IF (randn > (2/3)) THEN 
    SET split = 3;
  ELSE
    SET split = 2; 
  END IF;
END LOOP; 
"""

query_job = client.query(traintestsplit,
    job_config=job_config,
)  # Make an API request.
query_job.result()  # Wait for the job to complete.

I get the error that someone else got, BadRequest: 400 configuration.query.destinationTable cannot be set for scripts

(job ID: 676675d7-9151-4626-8a7e-96263232f7b2) and have read through Cannot set destination table with BigQuery Python API but I need something that stays constant if I am going to use these partitions.

Should I approach this problem in another way? A very naive way would be to pull the IDs from the BigQuery table, generate a random number, save the random number as a CSV, and then do a join every time I pull the data but that seems terribly inefficient.

来源：https://stackoverflow.com/questions/65590829/best-way-to-create-randomly-assigned-partitions-in-google-bigquery

标签

google-bigquery

gcp-ai-platform-notebook