问题
I have the table of users (username, gender, date_of_birth, zip) where the user's id is permanent but the user could be registered many times in the past where sometimes he filled out all the data and sometimes not. Besides that, he could change the residency (in this case zip can change).
So the query
SELECT username, sex, date_birth, zip FROM users_log WHERE username IN('user1', 'user2', 'user3')
returns the following result:
"user1";"M";"1982-10-04 00:00:00";"6320"
"user2";"";"";"1537"
"user3";"";"";"1537"
"user3";"";"";"1000"
"user3";"";"";"1000"
"user3";"";"1979-05-29 00:00:00";"1000"
"user3";"";"";"1537"
"user3";"";"1979-05-29 00:00:00";"1000"
"user1";"";"";"1000"
"user3";"";"";"1537"
In this case the user1 has changed the residence; the zip code changed; and the second row that 'belongs' to him does not contain demographic data. User3 has also multiple records and only two records contain demographic data.
What I would like to do is to bind users with the row that contains the most data about him and consider the zip included in the row with the most known values. Does anyone know how to write the appropriate query?
Thanks!
回答1:
It's gonna be painful; very painful.
Your question isn't clear about this issue, but I'm assuming that the 'user id' you're referring to is the user name. There are consequential modifications to make if that's wrong.
As with any complex query, build it up in stages.
Stage 1: How many non-null fields are there per record?
SELECT username, sex, date_of_birth, zip,
CASE WHEN sex IS NULL THEN 0 ELSE 1 END +
CASE WHEN date_of_birth IS NULL THEN 0 ELSE 1 END +
CASE WHEN zip IS NULL THEN 0 ELSE 1 END AS num_non_null_fields
FROM users_log
Stage 2: Which is the maximum such number of fields for a given user name?
SELECT username, MAX(num_non_null_fields) AS num_non_null_fields
FROM (SELECT username, sex, date_of_birth, zip,
CASE WHEN sex IS NULL THEN 0 ELSE 1 END +
CASE WHEN date_of_birth IS NULL THEN 0 ELSE 1 END +
CASE WHEN zip IS NULL THEN 0 ELSE 1 END AS num_non_null_fields
FROM users_log
) AS u
GROUP BY username
Stage 3: Select (all) the rows for a given user with that maximal number of non-null fields:
SELECT u.username, u.sex, u.date_of_birth, u.zip
FROM (SELECT username, MAX(num_non_null_fields) AS num_non_null_fields
FROM (SELECT username, sex, date_of_birth, zip,
CASE WHEN sex IS NULL THEN 0 ELSE 1 END +
CASE WHEN date_of_birth IS NULL THEN 0 ELSE 1 END +
CASE WHEN zip IS NULL THEN 0 ELSE 1 END AS num_non_null_fields
FROM users_log
) AS u
GROUP BY username
) AS v
JOIN (SELECT username, sex, date_of_birth, zip,
CASE WHEN sex IS NULL THEN 0 ELSE 1 END +
CASE WHEN date_of_birth IS NULL THEN 0 ELSE 1 END +
CASE WHEN zip IS NULL THEN 0 ELSE 1 END AS num_non_null_fields
FROM users_log
) AS u
ON u.username = v.username AND u.num_non_null_fields = v.num_non_null_fields;
Now, if someone has multiple rows with (say) all three fields filled in, then all those rows will be returned. However, you've not specified any criteria by which to choose between those rows.
The basic techniques here can be adapted to any changed requirements. The key is to build and test the sub-queries as you go.
None of this SQL has been near a DBMS; there could be bugs in it.
You've not specified which DBMS you are using. However, it seems that Oracle won't like the AS notation used for table aliases, though it has no problem with AS on column aliases. If you're using any other DBMS, you shouldn't have to worry about that minor eccentricity.
回答2:
Luckily you are using PostgreSQL. It's easier to count fields that are filled by casting boolean to integer:
SELECT username,
(
(sex is not null)::int
+ (date_birth_birth is not null)::int
+ (zip is not null)::int
) / 3.0 as percent_complete
FROM users_log
Your code objective has similarity with this problem:
Postgresql: Calculate rank by number of true OR clauses
来源:https://stackoverflow.com/questions/10242991/sql-how-to-select-the-row-with-most-known-values