SQL: how to select the row with most known values?

你。 提交于 2019-12-11 08:24:41

问题


I have the table of users (username, gender, date_of_birth, zip) where the user's id is permanent but the user could be registered many times in the past where sometimes he filled out all the data and sometimes not. Besides that, he could change the residency (in this case zip can change).

So the query

SELECT username, sex, date_birth, zip FROM users_log WHERE username IN('user1', 'user2', 'user3')

returns the following result:

"user1";"M";"1982-10-04 00:00:00";"6320"
"user2";"";"";"1537"
"user3";"";"";"1537"
"user3";"";"";"1000"
"user3";"";"";"1000"
"user3";"";"1979-05-29 00:00:00";"1000"
"user3";"";"";"1537"
"user3";"";"1979-05-29 00:00:00";"1000"
"user1";"";"";"1000"
"user3";"";"";"1537"

In this case the user1 has changed the residence; the zip code changed; and the second row that 'belongs' to him does not contain demographic data. User3 has also multiple records and only two records contain demographic data.

What I would like to do is to bind users with the row that contains the most data about him and consider the zip included in the row with the most known values. Does anyone know how to write the appropriate query?

Thanks!


回答1:


It's gonna be painful; very painful.

Your question isn't clear about this issue, but I'm assuming that the 'user id' you're referring to is the user name. There are consequential modifications to make if that's wrong.

As with any complex query, build it up in stages.

Stage 1: How many non-null fields are there per record?

SELECT username, sex, date_of_birth, zip,
       CASE WHEN sex           IS NULL THEN 0 ELSE 1 END +
       CASE WHEN date_of_birth IS NULL THEN 0 ELSE 1 END +
       CASE WHEN zip           IS NULL THEN 0 ELSE 1 END AS num_non_null_fields
  FROM users_log

Stage 2: Which is the maximum such number of fields for a given user name?

SELECT username, MAX(num_non_null_fields) AS num_non_null_fields
  FROM (SELECT username, sex, date_of_birth, zip,
               CASE WHEN sex           IS NULL THEN 0 ELSE 1 END +
               CASE WHEN date_of_birth IS NULL THEN 0 ELSE 1 END +
               CASE WHEN zip           IS NULL THEN 0 ELSE 1 END AS num_non_null_fields
          FROM users_log
       ) AS u
 GROUP BY username

Stage 3: Select (all) the rows for a given user with that maximal number of non-null fields:

SELECT u.username, u.sex, u.date_of_birth, u.zip
  FROM (SELECT username, MAX(num_non_null_fields) AS num_non_null_fields
          FROM (SELECT username, sex, date_of_birth, zip,
                       CASE WHEN sex           IS NULL THEN 0 ELSE 1 END +
                       CASE WHEN date_of_birth IS NULL THEN 0 ELSE 1 END +
                       CASE WHEN zip           IS NULL THEN 0 ELSE 1 END AS num_non_null_fields
                  FROM users_log
               ) AS u
         GROUP BY username
       ) AS v
  JOIN (SELECT username, sex, date_of_birth, zip,
               CASE WHEN sex           IS NULL THEN 0 ELSE 1 END +
               CASE WHEN date_of_birth IS NULL THEN 0 ELSE 1 END +
               CASE WHEN zip           IS NULL THEN 0 ELSE 1 END AS num_non_null_fields
          FROM users_log
       ) AS u
    ON u.username = v.username AND u.num_non_null_fields = v.num_non_null_fields;

Now, if someone has multiple rows with (say) all three fields filled in, then all those rows will be returned. However, you've not specified any criteria by which to choose between those rows.

The basic techniques here can be adapted to any changed requirements. The key is to build and test the sub-queries as you go.

None of this SQL has been near a DBMS; there could be bugs in it.

You've not specified which DBMS you are using. However, it seems that Oracle won't like the AS notation used for table aliases, though it has no problem with AS on column aliases. If you're using any other DBMS, you shouldn't have to worry about that minor eccentricity.




回答2:


Luckily you are using PostgreSQL. It's easier to count fields that are filled by casting boolean to integer:

SELECT username, 
   ( 
      (sex is not null)::int 
    + (date_birth_birth is not null)::int 
    + (zip is not null)::int
   ) / 3.0 as percent_complete
FROM users_log

Your code objective has similarity with this problem:
Postgresql: Calculate rank by number of true OR clauses



来源:https://stackoverflow.com/questions/10242991/sql-how-to-select-the-row-with-most-known-values

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!