SQL - not exists query with millions of records

问题

I'm trying to use the following SQL query (in SAS) to find any records from pool1 that do not exist in pool2. Pool1 has 11,000,000 records, pool2 has 700,000. This is where I run into an issue. I let the query run for 16 hours and it was nowhere near finishing. Is there a more efficient way (in SQL or SAS) to achieve what I need to do?

PROC SQL;
CREATE TABLE ALL AS
SELECT A.ID
FROM POOL1 A
WHERE NOT EXISTS (SELECT B.ID
                  FROM POOL2 B
                  WHERE B.ID = A.ID);
QUIT;

回答1:

PROC SQL;
CREATE TABLE ALL AS
SELECT A.ID
    FROM
        POOL1 A
    WHERE A.ID NOT IN (SELECT B.ID
                        FROM
                            POOL2 B)
                        ;

The above change should return the same result set but take considerably less time to run as you are not trying to join POOL2 back to POOL1 but simply excluding results which exist in POOL2.

As stated in another answer, an INDEX may help but if the ID fields are the primary keys it is likely they are already subject to in INDEX.

回答2:

Your query is fine. In most databases, not exists is the best way (or one of the best ways) to express this logic.

However, you need an index for performance:

create index idx_pool2_id on pool2(id);

回答3:

You're doing this in SAS, so why not use a data step?

data all;
  merge pool1(in = a) pool2(in = b keep = ID);
  by ID;
  if a and not(b);
run;

This requires that both datasets are either sorted or indexed by ID. If you have multiple records per ID in B then I would suggest deduplicating first via

proc sort data = pool2 out = temp nodupkey;
  by id;
run;

回答4:

PROC SQL;
CREATE TABLE ALL AS
SELECT A.ID
  FROM POOL1 A
       LEFT JOIN POOL2 B ON B.ID = A.ID
 WHERE B.ID IS NULL

来源：https://stackoverflow.com/questions/31430157/sql-not-exists-query-with-millions-of-records

标签

sql

sas

not-exists