Redshift: Update or Insert each row in column with random data from another table

故事扮演 提交于 2019-12-12 18:08:47

问题


update testdata.dataset1
   set abcd = (select abc 
               from dataset2
               order by random()
               limit 1
              ) 

Doing this only makes one random entry from table dataset2 is getting populated in all the rows of dataset1 table.

What I need is to generate each row with random entry from dataset2 table to dataset1 table.

Notice: dataset1 can be greater than dataset2.


回答1:


Query 1

You should pass abcd into your subquery to prevent "optimizing".

UPDATE dataset1
    SET abcd = (SELECT abc
                FROM dataset2
                WHERE abcd = abcd
                ORDER BY random()
                LIMIT 1
               );

SQL Fiddle

Query 2

The query below should be faster on plain PostgreSQL.

UPDATE dataset1
    SET abcd = (SELECT abc
                FROM dataset2
                WHERE abcd = abcd
                OFFSET floor(random()*(SELECT COUNT(*) FROM dataset2))
                LIMIT 1
               );

SQL Fiddle

However, as you have reported, it is not the case on Redshift, which is a columnar storage.

Query 3

Fetching all the records from dataset2 in a single query would be more efficient than fetching records one by one. Let's test:

UPDATE dataset1 original
SET abcd = fake.abc FROM 
              (SELECT ROW_NUMBER() OVER(ORDER BY random()) AS id, abc FROM dataset2) AS fake
               WHERE original.id % (SELECT COUNT(*) FROM dataset2) = fake.id - 1;

SQL Fiddle

Note that the integer id column should exist in dataset1.
Also, for dataset1.id's that are greater than the number of records in dataset2, abcd's are predictable.

Query 4

Let's create the integer fake_id column in dataset1, prefill it with random values and perform join on dataset1.fake_id = dataset2.id:

UPDATE dataset1
SET fake_id = floor(random()*(SELECT COUNT(*) FROM dataset2)) + 1;  

UPDATE dataset1
SET abcd = abc
FROM dataset2
WHERE dataset1.fake_id = dataset2.id;

SQL Fiddle

Query 5

If you don't want to add fake_id column to dataset1, let's calculate fake_id's "on the fly":

UPDATE dataset1
SET abcd = abc
FROM (
SELECT with_fake_id.id, dataset2.abc FROM 
(SELECT dataset1.id,  floor(RANDOM()*(SELECT COUNT(*) FROM dataset2) + 1) AS fake_id FROM dataset1) AS with_fake_id
JOIN dataset2 ON with_fake_id.fake_id = dataset2.id ) AS joined
WHERE dataset1.id = joined.id;

SQL Fiddle


Performance

On plain PostgreSQL, query 4 seems to be the most efficient.
I'll try to compare performance on a trial DC1.Large instance.



来源:https://stackoverflow.com/questions/45535038/redshift-update-or-insert-each-row-in-column-with-random-data-from-another-tabl

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!