问题
I'm building my first de-identification script, and running into issues with my approach.
I have a table dbo.pseudonyms whose firstname column is populated with 200 rows of data. Every row in this column of 200 rows has a value (none are null). This table also has an id column (int, primary key, not null) with the numbers 1-200.
What I want to do is, in one statement, re-populate my entire USERS table with firstname data randomly selected for each row from my pseudonyms table.
To generate the random number for picking I'm using ABS(Checksum(NewId())) % 200. Every time I do SELECT ABS(Checksum(NewId())) % 200 I get a numeric value in the range I'm looking for just fine, no intermittently erratic behavior.
HOWEVER, when I use this formula in the following statement:
SELECT pn.firstname
FROM DeIdentificationData.dbo.pseudonyms pn
WHERE pn.id = ABS(Checksum(NewId())) % 200
I get VERY intermittent results. I'd say about 30% of the results return one name picked out of the table (this is the expected result), about 30% come back with more than one result (which is baffling, there are no duplicate id column values), and about 30% come back with NULL (even though there are no empty rows in the firstname column)
I did look for quite a while for this specific issue, but to no avail so far. I'm assuming the issue has to do with using this formula as a pointer, but I'd be at a loss how to do this otherwise.
Thoughts?
回答1:
Why your query in the question returns unexpected results
Your original query selects from Pseudonyms. Server scans through each row of the table, picks the ID from that row, generates a random number, compares the generated number to the ID.
When by chance the generated number for particular row happen to be the same as ID of that row, this row is returned in the result set. It is quite possible that by chance generated number would never be the same as ID, as well as that generated number coincided with ID several times.
A bit more detailed:
- Server picks a row with
ID=1. - Generates a random number, say
25. Why not? A decent random number. - Is
1 = 25? No => This row is not returned. - Server picks a row with
ID=2. - Generates a random number, say
125. Why not? A decent random number. - Is
2 = 125? No => This row is not returned. - And so on...
Here is a complete solution on SQL Fiddle
Sample data
DECLARE @VarPseudonyms TABLE (ID int IDENTITY(1,1), PseudonymName varchar(50) NOT NULL);
DECLARE @VarUsers TABLE (ID int IDENTITY(1,1), UserName varchar(50) NOT NULL);
INSERT INTO @VarUsers (UserName)
SELECT TOP(1000)
'UserName' AS UserName
FROM sys.all_objects
ORDER BY sys.all_objects.object_id;
INSERT INTO @VarPseudonyms (PseudonymName)
SELECT TOP(200)
'PseudonymName'+CAST(ROW_NUMBER() OVER(ORDER BY sys.all_objects.object_id) AS varchar) AS PseudonymName
FROM sys.all_objects
ORDER BY sys.all_objects.object_id;
Table Users has 1000 rows with the same UserName for each row. Table Pseudonyms has 200 rows with different PseudonymNames:
SELECT * FROM @VarUsers;
ID UserName
-- --------
1 UserName
2 UserName
...
999 UserName
1000 UserName
SELECT * FROM @VarPseudonyms;
ID PseudonymName
-- -------------
1 PseudonymName1
2 PseudonymName2
...
199 PseudonymName199
200 PseudonymName200
First attempt
At first I tried a direct approach. For each row in Users I want to get one random row from Pseudonyms:
SELECT
U.ID
,U.UserName
,CA.PseudonymName
FROM
@VarUsers AS U
CROSS APPLY
(
SELECT TOP(1)
P.PseudonymName
FROM @VarPseudonyms AS P
ORDER BY CRYPT_GEN_RANDOM(4)
) AS CA
;
It turns out that optimizer is too smart and this produced some random, but the same PseudonymName for each User, which is not what I expected:
ID UserName PseudonymName
1 UserName PseudonymName181
2 UserName PseudonymName181
...
999 UserName PseudonymName181
1000 UserName PseudonymName181
So, I tweaked this approach a bit and generated a random number for each row in Users first. Then I used the generated number to find the Pseudonym with this ID for each row in Users using CROSS APPLY.
CTE_Users has an extra column with random number from 1 to 200. In CTE_Joined we pick a row from Pseudonyms for each User.
Finally we UPDATE the original Users table.
Final solution
WITH
CTE_Users
AS
(
SELECT
U.ID
,U.UserName
,1 + 200 * (CAST(CRYPT_GEN_RANDOM(4) as int) / 4294967295.0 + 0.5) AS rnd
FROM @VarUsers AS U
)
,CTE_Joined
AS
(
SELECT
CTE_Users.ID
,CTE_Users.UserName
,CA.PseudonymName
FROM
CTE_Users
CROSS APPLY
(
SELECT P.PseudonymName
FROM @VarPseudonyms AS P
WHERE P.ID = CAST(CTE_Users.rnd AS int)
) AS CA
)
UPDATE CTE_Joined
SET UserName = PseudonymName;
Results
SELECT * FROM @VarUsers;
ID UserName
1 PseudonymName41
2 PseudonymName132
3 PseudonymName177
...
998 PseudonymName60
999 PseudonymName141
1000 PseudonymName157
SQL Fiddle
来源:https://stackoverflow.com/questions/29760225/how-to-update-each-row-of-a-table-with-a-random-row-from-another-table