PostgreSQL - Select only 1 row for each ID

问题

SITUATION

I am working on a travel engine site and am writing a complex query to match visitors' search queries with their bookings based on IP Address, Destination and Date so I can work out the conversion ratio later.

PROBLEM

There needs to be multiple conversion ratios based on a parameter (In this case, the utm_source which I extract from a RequestUrl stored in the searches table). The problem is some users make multiple searches from different locations.. sometimes we get the utm_source in the request and sometimes not... and of course we need to match to only 1 booking. See screenshot of query result below to better understand:

See the 3rd and 4th rows have the same booking ID, etc.. but different values for the Value column. I need to select only 1 of these, but not both. Basically, if there is more than 1, I need to choose the 1 that is not "N/A".

MY QUERY:

SELECT DISTINCT "B"."Id" AS "BookingId", "PQ"."IPAddress", "PQ"."To", "PQ"."SearchDate", "PQ"."Value"
FROM
(
    SELECT DISTINCT "IPAddress", "To", "CreatedAt"::date AS "SearchDate", COALESCE(SUBSTRING("RequestUrl", 'utm_source=([^&]*)'), 'N/A') AS "Value"
    FROM dbo."PackageQueries"
    WHERE "SiteId" = '<The ID>'
    AND "CreatedAt" >= '<Start Date>'
    AND "CreatedAt" < '<End Date>'
) AS "PQ"
INNER JOIN dbo."Bookings" AS "B"
    ON "PQ"."IPAddress" = "B"."IPAddress"
    AND "B"."To" = "PQ"."To"
    AND "B"."BookingDate"::date = "PQ"."SearchDate"
WHERE "B"."SiteId" = '<The ID>'
AND "B"."BookingStatus" = 2
AND "B"."BookingDate" >= '<Start Date>'
AND "B"."BookingDate" < '<End Date>'
ORDER BY "B"."Id", "PQ"."IPAddress", "PQ"."To";

回答1:

I found a solution and based it on what I found here: Return rows that are max of one column in Postgresql and here: Postgres CASE in ORDER BY using an alias

My solution is as follows:

SELECT "BookingId", "IPAddress", "To", "SearchDate", "Value"
FROM
(
    SELECT DISTINCT
        "B"."Id" AS "BookingId",
        "PQ"."IPAddress",
        "PQ"."To",
        "PQ"."SearchDate",
        "PQ"."Value",
        RANK() OVER
        (
            PARTITION BY "B"."Id"
            ORDER BY
            CASE
                WHEN "PQ"."Value" = 'N/A' THEN 1
                ELSE 0
            END
        ) AS "RowNumber"
    FROM
    (
        SELECT DISTINCT "IPAddress", "To", "CreatedAt"::date AS "SearchDate", COALESCE(SUBSTRING("RequestUrl", 'utm_source=([^&]*)'), 'N/A') AS "Value"
        FROM dbo."PackageQueries"
        WHERE "SiteId" = '<Site ID>'
        AND "CreatedAt" >= '<Start Date>'
        AND "CreatedAt" < '<End Date>'
    ) AS "PQ"
    INNER JOIN dbo."Bookings" AS "B"
        ON "PQ"."IPAddress" = "B"."IPAddress"
        AND "B"."To" = "PQ"."To"
        AND "B"."BookingDate"::date = "PQ"."SearchDate"
    WHERE "B"."SiteId" = '<Site ID>'
    AND "B"."BookingStatus" = 2
    AND "B"."BookingDate" >= '<Start Date>'
    AND "B"."BookingDate" < '<End Date>'
) T
WHERE "RowNumber" = 1
ORDER BY "BookingId", "IPAddress", "To";

Somewhat long-winded, but it does the trick nicely. I hope it helps others.

EDIT

This wasn't the end of the story: there were still some cases in which I was getting more than 1 value. The answer was to modify the CASE statement to generate a unique number for each text value. The solution can be found here: PostgreSQL - Assign integer value to string in case statement

来源：https://stackoverflow.com/questions/42129367/postgresql-select-only-1-row-for-each-id

标签

sql

postgresql

greatest-n-per-group