问题
update testdata.dataset1
set abcd = (select abc
from dataset2
order by random()
limit 1
)
Doing this only makes one random entry from table dataset2
is getting populated in all the rows of dataset1
table.
What I need is to generate each row with random entry from dataset2
table to dataset1
table.
Notice: dataset1
can be greater than dataset2
.
回答1:
Query 1
You should pass abcd
into your subquery to prevent "optimizing".
UPDATE dataset1
SET abcd = (SELECT abc
FROM dataset2
WHERE abcd = abcd
ORDER BY random()
LIMIT 1
);
SQL Fiddle
Query 2
The query below should be faster on plain PostgreSQL.
UPDATE dataset1
SET abcd = (SELECT abc
FROM dataset2
WHERE abcd = abcd
OFFSET floor(random()*(SELECT COUNT(*) FROM dataset2))
LIMIT 1
);
SQL Fiddle
However, as you have reported, it is not the case on Redshift, which is a columnar storage.
Query 3
Fetching all the records from dataset2
in a single query would be more efficient than fetching records one by one. Let's test:
UPDATE dataset1 original
SET abcd = fake.abc FROM
(SELECT ROW_NUMBER() OVER(ORDER BY random()) AS id, abc FROM dataset2) AS fake
WHERE original.id % (SELECT COUNT(*) FROM dataset2) = fake.id - 1;
SQL Fiddle
Note that the integer id
column should exist in dataset1
.
Also, for dataset1.id
's that are greater than the number of records in dataset2
, abcd
's are predictable.
Query 4
Let's create the integer fake_id
column in dataset1
, prefill it with random values and perform join on dataset1.fake_id = dataset2.id
:
UPDATE dataset1
SET fake_id = floor(random()*(SELECT COUNT(*) FROM dataset2)) + 1;
UPDATE dataset1
SET abcd = abc
FROM dataset2
WHERE dataset1.fake_id = dataset2.id;
SQL Fiddle
Query 5
If you don't want to add fake_id
column to dataset1
, let's calculate fake_id
's "on the fly":
UPDATE dataset1
SET abcd = abc
FROM (
SELECT with_fake_id.id, dataset2.abc FROM
(SELECT dataset1.id, floor(RANDOM()*(SELECT COUNT(*) FROM dataset2) + 1) AS fake_id FROM dataset1) AS with_fake_id
JOIN dataset2 ON with_fake_id.fake_id = dataset2.id ) AS joined
WHERE dataset1.id = joined.id;
SQL Fiddle
Performance
On plain PostgreSQL, query 4 seems to be the most efficient.
I'll try to compare performance on a trial DC1.Large instance.
来源:https://stackoverflow.com/questions/45535038/redshift-update-or-insert-each-row-in-column-with-random-data-from-another-tabl