SQL - not exists query with millions of records

给你一囗甜甜゛ 提交于 2019-12-06 13:30:53
Steve Matthews
PROC SQL;
CREATE TABLE ALL AS
SELECT A.ID
    FROM
        POOL1 A
    WHERE A.ID NOT IN (SELECT B.ID
                        FROM
                            POOL2 B)
                        ;

The above change should return the same result set but take considerably less time to run as you are not trying to join POOL2 back to POOL1 but simply excluding results which exist in POOL2.

As stated in another answer, an INDEX may help but if the ID fields are the primary keys it is likely they are already subject to in INDEX.

Your query is fine. In most databases, not exists is the best way (or one of the best ways) to express this logic.

However, you need an index for performance:

create index idx_pool2_id on pool2(id);

You're doing this in SAS, so why not use a data step?

data all;
  merge pool1(in = a) pool2(in = b keep = ID);
  by ID;
  if a and not(b);
run;

This requires that both datasets are either sorted or indexed by ID. If you have multiple records per ID in B then I would suggest deduplicating first via

proc sort data = pool2 out = temp nodupkey;
  by id;
run;
PROC SQL;
CREATE TABLE ALL AS
SELECT A.ID
  FROM POOL1 A
       LEFT JOIN POOL2 B ON B.ID = A.ID
 WHERE B.ID IS NULL
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!