Redshift: Update or Insert each row in column with random data from another table

问题

update testdata.dataset1
   set abcd = (select abc 
               from dataset2
               order by random()
               limit 1
              )

Doing this only makes one random entry from table dataset2 is getting populated in all the rows of dataset1 table.

What I need is to generate each row with random entry from dataset2 table to dataset1 table.

Notice: dataset1 can be greater than dataset2.

回答1:

Query 1

You should pass abcd into your subquery to prevent "optimizing".

UPDATE dataset1
    SET abcd = (SELECT abc
                FROM dataset2
                WHERE abcd = abcd
                ORDER BY random()
                LIMIT 1
               );

SQL Fiddle

Query 2

The query below should be faster on plain PostgreSQL.

UPDATE dataset1
    SET abcd = (SELECT abc
                FROM dataset2
                WHERE abcd = abcd
                OFFSET floor(random()*(SELECT COUNT(*) FROM dataset2))
                LIMIT 1
               );

SQL Fiddle

However, as you have reported, it is not the case on Redshift, which is a columnar storage.

Query 3

Fetching all the records from dataset2 in a single query would be more efficient than fetching records one by one. Let's test:

UPDATE dataset1 original
SET abcd = fake.abc FROM 
              (SELECT ROW_NUMBER() OVER(ORDER BY random()) AS id, abc FROM dataset2) AS fake
               WHERE original.id % (SELECT COUNT(*) FROM dataset2) = fake.id - 1;

SQL Fiddle

Note that the integer id column should exist in dataset1.
Also, for dataset1.id's that are greater than the number of records in dataset2, abcd's are predictable.

Query 4

Let's create the integer fake_id column in dataset1, prefill it with random values and perform join on dataset1.fake_id = dataset2.id:

UPDATE dataset1
SET fake_id = floor(random()*(SELECT COUNT(*) FROM dataset2)) + 1;  

UPDATE dataset1
SET abcd = abc
FROM dataset2
WHERE dataset1.fake_id = dataset2.id;

SQL Fiddle

Query 5

If you don't want to add fake_id column to dataset1, let's calculate fake_id's "on the fly":

UPDATE dataset1
SET abcd = abc
FROM (
SELECT with_fake_id.id, dataset2.abc FROM 
(SELECT dataset1.id,  floor(RANDOM()*(SELECT COUNT(*) FROM dataset2) + 1) AS fake_id FROM dataset1) AS with_fake_id
JOIN dataset2 ON with_fake_id.fake_id = dataset2.id ) AS joined
WHERE dataset1.id = joined.id;

SQL Fiddle

Performance

On plain PostgreSQL, query 4 seems to be the most efficient.
I'll try to compare performance on a trial DC1.Large instance.

来源：https://stackoverflow.com/questions/45535038/redshift-update-or-insert-each-row-in-column-with-random-data-from-another-tabl

标签

amazon-redshift